The wave_post_bndpnt job walltime is extended in production. (9/11/23 update) A ldebug flag was added to the wave post ecf PBS statements to resolve long runtimes with those jobs. Initially the walltimes were extended but then the debug flag was added, the runtimes came back down, and the walltimes were reverted back. This is a temporary measure while the cause it determined and resolved. Plan to revert walltime change eventually.
Initial email from NCO:
Andrew
We schedule the GFS changes on coming Monday 9/11 at 1430z. It will be GFS v16.3.9 (from original para
gfs v16.3.8) since we implemented an ARFC v16.3.8 (based on prod v16.3.7) on 9/6 to add ladybug in
wave_post*pnt ecf scripts to resolve the long runtime issue.
Please let me know if you have any questions or need more information.
Thanks,
/Simon
SPA Office
Mentions of issue from SDM logs:
9/5 log:
CONTINUED...GFS WAVE WALLTIME EXTENDED/FAILURES
NWPS PROD JOB FAILURES
2a. 1104Z - SOS Fred extended the wallclock of 06Z
gfs_wave_post_bndpnt and gfs_wave_post_bndpntbll
2b. 1258Z -
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to the missing 06Z GFS data.
2c. 1346Z - GFS job finished. Fred reran the nwps prep job to
completion.
2d. 1710Z - Same for 12Z jgfs_wave_post_bndpntbll.
2e. 1808Z - Fred extended the wallclock of 12Z jgfs_wave_postpnt
job.
2f. 2318Z - SOS Houmin reported that aborted:
/prod/primary/18/gfs/v16.3/gfs/wave/post/jgfs_wave_post_bndpnt
failed due to walltime limit. Reran with increased time. (RJS)
2g. 0056Z - Houmin reports job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed as
waiting gfs post jobs. Houmin will rerun when the gfs wave
completes. (RJS)
2h. 0247Z - SOS Kevin reports that the NWPS rerun completed.
The 18Z gfs wave is still running. 0158Z - GFS wave job
completed. (RJS)
2i. 0658Z - Job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to missing 00Z GFS wave data. 0741Z - SOS Kevin reports
that the nwps prep job rerun completed. 0825Z - SS Kevin
reports that the 00Z GFS wave post job runs completed. (RLR)
9/6 log:
CONTINUED...GFS WAVE WALLTIME EXTENDED
3a. 1212Z - SOS Ying noted the gfs_wave_post_bndpnt job was
running long and wall time for the job has been extended. 1340Z
- Complete.
3b. 2105Z - SPA Simon reports testing continues and requested
permission to increase the wallclock of the jobs that are
failing to allow ops to run smoother overnight. Approved. (KAL)
9/11 email from Simon:
The "ldebug" was added into gfs_wave post*pnt* ecf scripts to resolve the long runtime issue.
From GDIT, it will It only remounts /apps, lustre debugging actions are skipped. Since the
"ldebug" option works, we have used the original walltime for these gfs_wave post*pnt* ecf
scripts in prod gfs.v16.3.8 and v16.3.9.
Target version
v16.3.8
Expected workflow changes
Walltimes in relevant ecf scripts.
Add additional debug PBS statements into wave post ecf scripts.
Description
The wave_post_bndpnt job walltime is extended in production.(9/11/23 update) Aldebug
flag was added to the wave post ecf PBS statements to resolve long runtimes with those jobs. Initially the walltimes were extended but then the debug flag was added, the runtimes came back down, and the walltimes were reverted back.This is a temporary measure while the cause it determined and resolved. Plan to revert walltime change eventually.Initial email from NCO:
Mentions of issue from SDM logs: 9/5 log:
9/6 log:
9/11 email from Simon:
Target version
v16.3.8
Expected workflow changes
Walltimes in relevant ecf scripts.Add additional debug PBS statements into wave post ecf scripts.FYI @JessicaMeixner-NOAA