Closed DavidHuber-NOAA closed 2 weeks ago
@DavidHuber-NOAA First and foremost THANK YOU!!!! I'm about to run some tests w/and w/out this change to compare. Would it be helpful if I ran this on a particular machine to take a look at certain aspects, or is there output results I can just double check there instead?
@JessicaMeixner-NOAA With the updates to the UFS coming in yesterday, I think my baseline dataset will be out of date for comparisons, so I think you will want to generate a new baseline from develop. I believe a test on any system would be fine for comparing outputs from develop and the feature/opt_pnt branch. But if you would like to see timing differences, then I would recommend running on Hercules since it has had the biggest issues with the gfswavepostpnt job.
@DavidHuber-NOAA Happy to run a new test and thanks again for trying to make everyone's life easier while we work on the longer term solution!
@DavidHuber-NOAA wanted to let you know that I added the boundary point jobs in my test and they timed out, which isn't too surprising to be honest. I've extended the length and am re-running. Hopefully will have results by the end of the day. I will keep you posted!
@DavidHuber-NOAA I'm still having issues with jobs failing and timing out. I did get the develop branch tests to at least run the main point job, working on the boundary jobs.
For the test running this PR, I got a failure in the job:
Executing the boundary point cat script at : Thu Jun 6 15:12:07 CDT 2024
------------------------------------
+ exgfs_wave_post_pnt.sh[562]: '[' 200 -gt 1 ']'
+ exgfs_wave_post_pnt.sh[564]: '[' YES = YES ']'
+ exgfs_wave_post_pnt.sh[565]: srun -l --export=ALL -n 200 '--multi-prog --output=mpmd.%j.%t.out' cmdmprogbuoy
srun: unrecognized option '--multi-prog --output=mpmd.%j.%t.out'
Try "srun --help" for more information
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692662 255
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 20:12:08 with error code 255 (time elapsed: 03:21:06)
+ JGLOBAL_WAVE_POST_PNT[1]: postamble JGLOBAL_WAVE_POST_PNT 1717692626 255
+ preamble.sh[70]: set +x
End JGLOBAL_WAVE_POST_PNT at 20:12:08 with error code 255 (time elapsed: 03:21:42)
+ wavepostpnt.sh[1]: postamble wavepostpnt.sh 1717692622 255
+ preamble.sh[70]: set +x
and another error:
+ exgfs_wave_post_pnt.sh[476]: srun -l --export=ALL -n 448 --multi-prog --output=mpmd.%j.%t.out cmdmprog
srun: warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 448 with the number of requested nodes 12. Ignoring --ntasks-per-node.
srun: error: hercules-01-31: task 381: Exited with exit code 1
srun: Terminating StepId=1420346.0
srun: error: hercules-01-26: tasks 189-225: Terminated
srun: error: hercules-01-31: tasks 374-380,382-410: Terminated
srun: error: hercules-01-27: tasks 226-262: Terminated
srun: error: hercules-01-28: tasks 263-299: Terminated
srun: error: hercules-01-23: tasks 76-113: Terminated
srun: error: hercules-01-29: tasks 300-336: Terminated
srun: error: hercules-01-30: tasks 337-373: Terminated
srun: error: hercules-01-24: tasks 114-151: Terminated
srun: error: hercules-01-25: tasks 152-188: Terminated
srun: error: hercules-01-21: tasks 0-37: Terminated
srun: error: hercules-01-32: tasks 411-447: Terminated
srun: error: hercules-01-22: tasks 38-75: Terminated
srun: Force Terminated StepId=1420346.0
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692640 143
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 17:21:01 with error code 143 (time elapsed: 00:30:21)
+ JGLOBAL_WAVE_POST_BNDPNTBLL[1]: postamble JGLOBAL_WAVE_POST_BNDPNTBLL 1717692632 143
+ preamble.sh[70]: set +x
Log files are here: /work2/noaa/marine/jmeixner/hercules/testpointupdate/update/test01/COMROOT/test01/logs/2021032312
I'm going to look into this more. I'm curious if this is because of changes to the line calling wavempexec. I vaguely remember struggling a lot with the "" on that line.
@JessicaMeixner-NOAA Thanks for testing. I'll dig in and see what the issue is.
@DavidHuber-NOAA I can help more next week if I need to as well. Need to do a couple other things today first though. I'll check in with you early next week.
@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.
I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.
If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.
Which one would you prefer? I'll test that way today.
I've just been testing the C48_S2SW and then turning on the boundary points (lowring the forecast hours will help speed up testing/checking, so probably a good idea for faster turn-around). That being said, I think we'll want whatever is supposed to be running in the CI to work. The only job that's usually run w/the CI is the point (not the boundaries) and I think that should run okay as is.
@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.
I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.
If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.
Which one would you prefer? I'll test that way today.
We're testing out the full 16 days, but only on WCOSS with the C96_atm3DVar_extended test, so no wave test.
@JessicaMeixner-NOAA I fixed a few bugs in my optimizations and all jobs have run to completion for a 24-hour forecast on Hercules. I did have to change the layout to get the current develop bll jobs to run in a reasonable period of time. I did this by limiting the number of jobs run per node. These are the layouts and runtimes for the 24 hour forecasts of the optimized and develop versions of these jobs:
Job Name | Task/Node | # Nodes | Runtime |
---|---|---|---|
gfswavepostbndpnt | 10 | 24 | 8400 |
gfswavepostbndpntbll | 20 | 24 | 2269 |
gfswavepostpnt | 40 | 5 | 6171 |
Job Name | Task/Node | # Nodes | Runtime (s) |
---|---|---|---|
gfswavepostbndpnt | 40 | 6 | 6186 |
gfswavepostbndpntbll | 40 | 12 | 1554 |
gfswavepostpnt | 40 | 5 | 4177 |
Based on these runtimes, I adjusted config.resources
wtime
upward for the wavepostbndpnt
job upward.
I also had to manually adjust the data dependency for the gfswavepostbndpntbll
job, which required the history file gfs.t12z.atm.logf180.txt
and in my XMLs were changed to ...024.txt
. I made changes to workflow/rocoto/gfs_tasks.py
and workflow/rocoto/gefs_tasks.py
to adjust the forecast hour to look for if the max is less than 180 hours.
The expdir
s and comroot
s for these test cases are located in
develop
/work2/noaa/global/dhuber/para/EXPDIR/dev_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt
feature/opt_pnt
/work2/noaa/global/dhuber/para/EXPDIR/opt_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/opt_bullpnt
The contents of the spec
, bull
, and cbull
tarballs (i.e. gfswave.t12z.bull_tar gfswave.t12z.cbull_tar gfswave.t12z.ibpbull_tar gfswave.t12z.ibpcbull_tar gfswave.t12z.ibp_tar gfswave.t12z.spec_tar.gz
in ${COMROOT}/gfs.20210323/12/products/wave/station
) were extracted to /work2/noaa/global/dhuber/station_data/opt
and /work2/noaa/global/dhuber/station_data/dev
and compared file by file. There were no differences between the experiments.
Marking this PR ready for review.
Thanks @DavidHuber-NOAA ! I'll get this reviewed today
@DavidHuber-NOAA apologies for not getting to this yesterday, trying to look at your output this morning and do not have permissions to see: /work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt/gfs.20210323/12/products
@JessicaMeixner-NOAA Thanks for taking a look. It should be opened up now.
@DavidHuber-NOAA thanks! i can see the directories now
Starting CI on Hera and Hercules.
Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_8f40ef76/logs/2021032412/gdasfcst.log
Follow link here to view the contents of the above file(s): (link)
Experiment C96_atm3DVar FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96_atm3DVar_8f40ef76
Experiment C48_S2SW FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_S2SW_8f40ef76
Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96_atmaerosnowDA_8f40ef76
Experiment C96C48_hybatmDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96C48_hybatmDA_8f40ef76
Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48mx500_3DVarAOWCDA_8f40ef76
Experiment C48_ATM FAILED on Hera with error logs:
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/COMROOT/C48_ATM_8f40ef76/logs/2021032312/gfsfcst.log
Follow link here to view the contents of the above file(s): (link)
Experiment C48_ATM FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_ATM_8f40ef76
Experiment C48_S2SWA_gefs FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_S2SWA_gefs_8f40ef76
Hera failed due to stmp being full.
Experiment C96C48_hybatmDA FAILED on Hercules with error logs:
/work2/noaa/stmp/CI/HERCULES/2657/RUNTESTS/COMROOT/C96C48_hybatmDA_8f40ef76/logs/2021122106/enkfgdaseupd.log
Follow link here to view the contents of the above file(s): (link)
Experiment C96C48_hybatmDA FAILED on Hercules in
/work2/noaa/stmp/CI/HERCULES/2657/RUNTESTS/C96C48_hybatmDA_8f40ef76
enkf.x
failed with an integer divide by zero
error during the eupd
job at line 724 of letkf.f90. This error does not make sense to me at that line as it is a variable declaration without any math:
real(r_kind), dimension(:), allocatable :: work1
Perhaps an earlier job produced some invalid data.
Since this PR only deals with the wavepostpnt
job, I think this failure could be ignored since all S2SW*
tests pass, but will defer to @aerorahul.
enkf.x
failed with aninteger divide by zero
error during theeupd
job at line 724 of letkf.f90. This error does not make sense to me at that line as it is a variable declaration without any math:real(r_kind), dimension(:), allocatable :: work1
Perhaps an earlier job produced some invalid data.
Since this PR only deals with the
wavepostpnt
job, I think this failure could be ignored since allS2SW*
tests pass, but will defer to @aerorahul.
I think that is fine. @CatherineThomas-NOAA has anyone you know experienced this on Hercules?
@aerorahul: I haven't heard of this problem on Hercules yet, but I also don't know of anyone really cycling over there. We run the GSI ctests there so the enkf in general is exercised, but it's not exactly testing a wide range of cases. Thanks for the heads up.
Description
Optimizes the gfswavepostpnt job.
This is done by 1) reducing the number of calls to
sed
,awk
,grep
, andcat
bycat
calls (e.g.cat <file> | sed 'something'
)sed
andgrep
calls when possibleawk
calls instead of handling that logic in bash 2) minimizing as much as possible the amount of data on disk that has to be read in (e.g. limiting sed to read only the line numbers it needs)Type of change
Change characteristics
How has this been tested?
C48_S2SW test case on Hercules. Outputs were identical to a non-optimized run. The optimized run took 80 minutes while the non-optimized version took 191 minutes.
Checklist