NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Optimize wavepostpnt #2657

Closed DavidHuber-NOAA closed 2 weeks ago

DavidHuber-NOAA commented 4 weeks ago

Description

Optimizes the gfswavepostpnt job.

This is done by 1) reducing the number of calls to sed, awk, grep, and cat by

Type of change

Change characteristics

How has this been tested?

C48_S2SW test case on Hercules. Outputs were identical to a non-optimized run. The optimized run took 80 minutes while the non-optimized version took 191 minutes.

Checklist

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA First and foremost THANK YOU!!!! I'm about to run some tests w/and w/out this change to compare. Would it be helpful if I ran this on a particular machine to take a look at certain aspects, or is there output results I can just double check there instead?

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA With the updates to the UFS coming in yesterday, I think my baseline dataset will be out of date for comparisons, so I think you will want to generate a new baseline from develop. I believe a test on any system would be fine for comparing outputs from develop and the feature/opt_pnt branch. But if you would like to see timing differences, then I would recommend running on Hercules since it has had the biggest issues with the gfswavepostpnt job.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA Happy to run a new test and thanks again for trying to make everyone's life easier while we work on the longer term solution!

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA wanted to let you know that I added the boundary point jobs in my test and they timed out, which isn't too surprising to be honest. I've extended the length and am re-running. Hopefully will have results by the end of the day. I will keep you posted!

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA I'm still having issues with jobs failing and timing out. I did get the develop branch tests to at least run the main point job, working on the boundary jobs.

For the test running this PR, I got a failure in the job:

   Executing the boundary point cat script at : Thu Jun  6 15:12:07 CDT 2024
   ------------------------------------

+ exgfs_wave_post_pnt.sh[562]: '[' 200 -gt 1 ']'
+ exgfs_wave_post_pnt.sh[564]: '[' YES = YES ']'
+ exgfs_wave_post_pnt.sh[565]: srun -l --export=ALL -n 200 '--multi-prog --output=mpmd.%j.%t.out' cmdmprogbuoy
srun: unrecognized option '--multi-prog --output=mpmd.%j.%t.out'
Try "srun --help" for more information
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692662 255
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 20:12:08 with error code 255 (time elapsed: 03:21:06)
+ JGLOBAL_WAVE_POST_PNT[1]: postamble JGLOBAL_WAVE_POST_PNT 1717692626 255
+ preamble.sh[70]: set +x
End JGLOBAL_WAVE_POST_PNT at 20:12:08 with error code 255 (time elapsed: 03:21:42)
+ wavepostpnt.sh[1]: postamble wavepostpnt.sh 1717692622 255
+ preamble.sh[70]: set +x

and another error:

+ exgfs_wave_post_pnt.sh[476]: srun -l --export=ALL -n 448 --multi-prog --output=mpmd.%j.%t.out cmdmprog
srun: warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 448 with the number of requested nodes 12. Ignoring --ntasks-per-node.
srun: error: hercules-01-31: task 381: Exited with exit code 1
srun: Terminating StepId=1420346.0
srun: error: hercules-01-26: tasks 189-225: Terminated
srun: error: hercules-01-31: tasks 374-380,382-410: Terminated
srun: error: hercules-01-27: tasks 226-262: Terminated
srun: error: hercules-01-28: tasks 263-299: Terminated
srun: error: hercules-01-23: tasks 76-113: Terminated
srun: error: hercules-01-29: tasks 300-336: Terminated
srun: error: hercules-01-30: tasks 337-373: Terminated
srun: error: hercules-01-24: tasks 114-151: Terminated
srun: error: hercules-01-25: tasks 152-188: Terminated
srun: error: hercules-01-21: tasks 0-37: Terminated
srun: error: hercules-01-32: tasks 411-447: Terminated
srun: error: hercules-01-22: tasks 38-75: Terminated
srun: Force Terminated StepId=1420346.0
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692640 143
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 17:21:01 with error code 143 (time elapsed: 00:30:21)
+ JGLOBAL_WAVE_POST_BNDPNTBLL[1]: postamble JGLOBAL_WAVE_POST_BNDPNTBLL 1717692632 143
+ preamble.sh[70]: set +x

Log files are here: /work2/noaa/marine/jmeixner/hercules/testpointupdate/update/test01/COMROOT/test01/logs/2021032312

I'm going to look into this more. I'm curious if this is because of changes to the line calling wavempexec. I vaguely remember struggling a lot with the "" on that line.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA Thanks for testing. I'll dig in and see what the issue is.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA I can help more next week if I need to as well. Need to do a couple other things today first though. I'll check in with you early next week.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.

I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.

If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.

Which one would you prefer? I'll test that way today.

JessicaMeixner-NOAA commented 3 weeks ago

I've just been testing the C48_S2SW and then turning on the boundary points (lowring the forecast hours will help speed up testing/checking, so probably a good idea for faster turn-around). That being said, I think we'll want whatever is supposed to be running in the CI to work. The only job that's usually run w/the CI is the point (not the boundaries) and I think that should run okay as is.

WalterKolczynski-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.

I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.

If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.

Which one would you prefer? I'll test that way today.

We're testing out the full 16 days, but only on WCOSS with the C96_atm3DVar_extended test, so no wave test.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA I fixed a few bugs in my optimizations and all jobs have run to completion for a 24-hour forecast on Hercules. I did have to change the layout to get the current develop bll jobs to run in a reasonable period of time. I did this by limiting the number of jobs run per node. These are the layouts and runtimes for the 24 hour forecasts of the optimized and develop versions of these jobs:


Develop

Job Name Task/Node # Nodes Runtime
gfswavepostbndpnt 10 24 8400
gfswavepostbndpntbll 20 24 2269
gfswavepostpnt 40 5 6171

Optimized

Job Name Task/Node # Nodes Runtime (s)
gfswavepostbndpnt 40 6 6186
gfswavepostbndpntbll 40 12 1554
gfswavepostpnt 40 5 4177

Based on these runtimes, I adjusted config.resources wtime upward for the wavepostbndpnt job upward.

I also had to manually adjust the data dependency for the gfswavepostbndpntbll job, which required the history file gfs.t12z.atm.logf180.txt and in my XMLs were changed to ...024.txt. I made changes to workflow/rocoto/gfs_tasks.py and workflow/rocoto/gefs_tasks.py to adjust the forecast hour to look for if the max is less than 180 hours.

The expdirs and comroots for these test cases are located in develop

/work2/noaa/global/dhuber/para/EXPDIR/dev_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt

feature/opt_pnt

/work2/noaa/global/dhuber/para/EXPDIR/opt_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/opt_bullpnt

The contents of the spec, bull, and cbull tarballs (i.e. gfswave.t12z.bull_tar gfswave.t12z.cbull_tar gfswave.t12z.ibpbull_tar gfswave.t12z.ibpcbull_tar gfswave.t12z.ibp_tar gfswave.t12z.spec_tar.gz in ${COMROOT}/gfs.20210323/12/products/wave/station) were extracted to /work2/noaa/global/dhuber/station_data/opt and /work2/noaa/global/dhuber/station_data/dev and compared file by file. There were no differences between the experiments.

DavidHuber-NOAA commented 3 weeks ago

Marking this PR ready for review.

JessicaMeixner-NOAA commented 3 weeks ago

Thanks @DavidHuber-NOAA ! I'll get this reviewed today

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA apologies for not getting to this yesterday, trying to look at your output this morning and do not have permissions to see: /work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt/gfs.20210323/12/products

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA Thanks for taking a look. It should be opened up now.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA thanks! i can see the directories now

DavidHuber-NOAA commented 2 weeks ago

Starting CI on Hera and Hercules.

emcbot commented 2 weeks ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_8f40ef76/logs/2021032412/gdasfcst.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 2 weeks ago

Experiment C96_atm3DVar FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96_atm3DVar_8f40ef76

emcbot commented 2 weeks ago

Experiment C48_S2SW FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_S2SW_8f40ef76

emcbot commented 2 weeks ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96_atmaerosnowDA_8f40ef76

emcbot commented 2 weeks ago

Experiment C96C48_hybatmDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C96C48_hybatmDA_8f40ef76

emcbot commented 2 weeks ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48mx500_3DVarAOWCDA_8f40ef76

emcbot commented 2 weeks ago

Experiment C48_ATM FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2657/RUNTESTS/COMROOT/C48_ATM_8f40ef76/logs/2021032312/gfsfcst.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 2 weeks ago

Experiment C48_ATM FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_ATM_8f40ef76

emcbot commented 2 weeks ago

Experiment C48_S2SWA_gefs FAILED on Hera in /scratch1/NCEPDEV/global/CI/2657/RUNTESTS/C48_S2SWA_gefs_8f40ef76

DavidHuber-NOAA commented 2 weeks ago

Hera failed due to stmp being full.

emcbot commented 2 weeks ago

Experiment C96C48_hybatmDA FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2657/RUNTESTS/COMROOT/C96C48_hybatmDA_8f40ef76/logs/2021122106/enkfgdaseupd.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 2 weeks ago

Experiment C96C48_hybatmDA FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2657/RUNTESTS/C96C48_hybatmDA_8f40ef76

DavidHuber-NOAA commented 2 weeks ago

enkf.x failed with an integer divide by zero error during the eupd job at line 724 of letkf.f90. This error does not make sense to me at that line as it is a variable declaration without any math:

real(r_kind), dimension(:), allocatable :: work1

Perhaps an earlier job produced some invalid data.

Since this PR only deals with the wavepostpnt job, I think this failure could be ignored since all S2SW* tests pass, but will defer to @aerorahul.

aerorahul commented 2 weeks ago

enkf.x failed with an integer divide by zero error during the eupd job at line 724 of letkf.f90. This error does not make sense to me at that line as it is a variable declaration without any math:

real(r_kind), dimension(:), allocatable :: work1

Perhaps an earlier job produced some invalid data.

Since this PR only deals with the wavepostpnt job, I think this failure could be ignored since all S2SW* tests pass, but will defer to @aerorahul.

I think that is fine. @CatherineThomas-NOAA has anyone you know experienced this on Hercules?

CatherineThomas-NOAA commented 2 weeks ago

@aerorahul: I haven't heard of this problem on Hercules yet, but I also don't know of anyone really cycling over there. We run the GSI ctests there so the enkf in general is exercised, but it's not exactly testing a wide range of cases. Thanks for the heads up.