DavidHuber-NOAA commented 4 weeks ago

Description

Optimizes the gfswavepostpnt job.

This is done by 1) reducing the number of calls to sed, awk, grep, and cat by

performing operations on all files at once instead of looping over each file
removing piped cat calls (e.g. cat <file> | sed 'something')
combining sed and grep calls when possible
adding logic to awk calls instead of handling that logic in bash 2) minimizing as much as possible the amount of data on disk that has to be read in (e.g. limiting sed to read only the line numbers it needs)

Type of change

Maintenance (optimization)

Change characteristics

Is this a breaking change (a change in existing functionality)? NO
Does this change require a documentation update? NO

How has this been tested?

C48_S2SW test case on Hercules. Outputs were identical to a non-optimized run. The optimized run took 80 minutes while the non-optimized version took 191 minutes.

Checklist

[x] Any dependent changes have been merged and published
[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] My changes generate no new warnings
[x] New and existing tests pass with my changes

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA First and foremost THANK YOU!!!! I'm about to run some tests w/and w/out this change to compare. Would it be helpful if I ran this on a particular machine to take a look at certain aspects, or is there output results I can just double check there instead?

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA With the updates to the UFS coming in yesterday, I think my baseline dataset will be out of date for comparisons, so I think you will want to generate a new baseline from develop. I believe a test on any system would be fine for comparing outputs from develop and the feature/opt_pnt branch. But if you would like to see timing differences, then I would recommend running on Hercules since it has had the biggest issues with the gfswavepostpnt job.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA Happy to run a new test and thanks again for trying to make everyone's life easier while we work on the longer term solution!

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA wanted to let you know that I added the boundary point jobs in my test and they timed out, which isn't too surprising to be honest. I've extended the length and am re-running. Hopefully will have results by the end of the day. I will keep you posted!

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA I'm still having issues with jobs failing and timing out. I did get the develop branch tests to at least run the main point job, working on the boundary jobs.

For the test running this PR, I got a failure in the job:

   Executing the boundary point cat script at : Thu Jun  6 15:12:07 CDT 2024
   ------------------------------------

+ exgfs_wave_post_pnt.sh[562]: '[' 200 -gt 1 ']'
+ exgfs_wave_post_pnt.sh[564]: '[' YES = YES ']'
+ exgfs_wave_post_pnt.sh[565]: srun -l --export=ALL -n 200 '--multi-prog --output=mpmd.%j.%t.out' cmdmprogbuoy
srun: unrecognized option '--multi-prog --output=mpmd.%j.%t.out'
Try "srun --help" for more information
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692662 255
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 20:12:08 with error code 255 (time elapsed: 03:21:06)
+ JGLOBAL_WAVE_POST_PNT[1]: postamble JGLOBAL_WAVE_POST_PNT 1717692626 255
+ preamble.sh[70]: set +x
End JGLOBAL_WAVE_POST_PNT at 20:12:08 with error code 255 (time elapsed: 03:21:42)
+ wavepostpnt.sh[1]: postamble wavepostpnt.sh 1717692622 255
+ preamble.sh[70]: set +x

and another error:

+ exgfs_wave_post_pnt.sh[476]: srun -l --export=ALL -n 448 --multi-prog --output=mpmd.%j.%t.out cmdmprog
srun: warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 448 with the number of requested nodes 12. Ignoring --ntasks-per-node.
srun: error: hercules-01-31: task 381: Exited with exit code 1
srun: Terminating StepId=1420346.0
srun: error: hercules-01-26: tasks 189-225: Terminated
srun: error: hercules-01-31: tasks 374-380,382-410: Terminated
srun: error: hercules-01-27: tasks 226-262: Terminated
srun: error: hercules-01-28: tasks 263-299: Terminated
srun: error: hercules-01-23: tasks 76-113: Terminated
srun: error: hercules-01-29: tasks 300-336: Terminated
srun: error: hercules-01-30: tasks 337-373: Terminated
srun: error: hercules-01-24: tasks 114-151: Terminated
srun: error: hercules-01-25: tasks 152-188: Terminated
srun: error: hercules-01-21: tasks 0-37: Terminated
srun: error: hercules-01-32: tasks 411-447: Terminated
srun: error: hercules-01-22: tasks 38-75: Terminated
srun: Force Terminated StepId=1420346.0
+ exgfs_wave_post_pnt.sh[1]: postamble exgfs_wave_post_pnt.sh 1717692640 143
+ preamble.sh[70]: set +x
End exgfs_wave_post_pnt.sh at 17:21:01 with error code 143 (time elapsed: 00:30:21)
+ JGLOBAL_WAVE_POST_BNDPNTBLL[1]: postamble JGLOBAL_WAVE_POST_BNDPNTBLL 1717692632 143
+ preamble.sh[70]: set +x

Log files are here: /work2/noaa/marine/jmeixner/hercules/testpointupdate/update/test01/COMROOT/test01/logs/2021032312

I'm going to look into this more. I'm curious if this is because of changes to the line calling wavempexec. I vaguely remember struggling a lot with the "" on that line.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA Thanks for testing. I'll dig in and see what the issue is.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA I can help more next week if I need to as well. Need to do a couple other things today first though. I'll check in with you early next week.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.

I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.

If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.

Which one would you prefer? I'll test that way today.

JessicaMeixner-NOAA commented 3 weeks ago

I've just been testing the C48_S2SW and then turning on the boundary points (lowring the forecast hours will help speed up testing/checking, so probably a good idea for faster turn-around). That being said, I think we'll want whatever is supposed to be running in the CI to work. The only job that's usually run w/the CI is the point (not the boundaries) and I think that should run okay as is.

WalterKolczynski-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA thanks! Hopefully I will be able to get somewhere today.

I did have a couple of ideas on the timing out jobs. Perhaps if we run a shorter forecast (say 48 hours), the pnt and bullpnt jobs would be able to finish.

If instead we should be testing out to 120 hours, then we may be able to scale back the number of tasks/node to say 20.

Which one would you prefer? I'll test that way today.

We're testing out the full 16 days, but only on WCOSS with the C96_atm3DVar_extended test, so no wave test.

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA I fixed a few bugs in my optimizations and all jobs have run to completion for a 24-hour forecast on Hercules. I did have to change the layout to get the current develop bll jobs to run in a reasonable period of time. I did this by limiting the number of jobs run per node. These are the layouts and runtimes for the 24 hour forecasts of the optimized and develop versions of these jobs:

Develop

Job Name	Task/Node	# Nodes	Runtime
gfswavepostbndpnt	10	24	8400
gfswavepostbndpntbll	20	24	2269
gfswavepostpnt	40	5	6171

Optimized

Job Name	Task/Node	# Nodes	Runtime (s)
gfswavepostbndpnt	40	6	6186
gfswavepostbndpntbll	40	12	1554
gfswavepostpnt	40	5	4177

Based on these runtimes, I adjusted config.resources wtime upward for the wavepostbndpnt job upward.

I also had to manually adjust the data dependency for the gfswavepostbndpntbll job, which required the history file gfs.t12z.atm.logf180.txt and in my XMLs were changed to ...024.txt. I made changes to workflow/rocoto/gfs_tasks.py and workflow/rocoto/gefs_tasks.py to adjust the forecast hour to look for if the max is less than 180 hours.

The expdirs and comroots for these test cases are located in develop

/work2/noaa/global/dhuber/para/EXPDIR/dev_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt

feature/opt_pnt

/work2/noaa/global/dhuber/para/EXPDIR/opt_bullpnt
/work2/noaa/global/dhuber/para/COMROOT/opt_bullpnt

The contents of the spec, bull, and cbull tarballs (i.e. gfswave.t12z.bull_tar gfswave.t12z.cbull_tar gfswave.t12z.ibpbull_tar gfswave.t12z.ibpcbull_tar gfswave.t12z.ibp_tar gfswave.t12z.spec_tar.gz in ${COMROOT}/gfs.20210323/12/products/wave/station) were extracted to /work2/noaa/global/dhuber/station_data/opt and /work2/noaa/global/dhuber/station_data/dev and compared file by file. There were no differences between the experiments.

DavidHuber-NOAA commented 3 weeks ago

Marking this PR ready for review.

JessicaMeixner-NOAA commented 3 weeks ago

Thanks @DavidHuber-NOAA ! I'll get this reviewed today

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA apologies for not getting to this yesterday, trying to look at your output this morning and do not have permissions to see: /work2/noaa/global/dhuber/para/COMROOT/dev_bullpnt/gfs.20210323/12/products

DavidHuber-NOAA commented 3 weeks ago

@JessicaMeixner-NOAA Thanks for taking a look. It should be opened up now.

JessicaMeixner-NOAA commented 3 weeks ago

@DavidHuber-NOAA thanks! i can see the directories now

DavidHuber-NOAA commented 2 weeks ago

Starting CI on Hera and Hercules.