NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
75 stars 168 forks source link

GFSv16.3.8 - add debug flag to resolve wave post job runtime issues #1843

Closed KateFriedman-NOAA closed 1 year ago

KateFriedman-NOAA commented 1 year ago

Description

The wave_post_bndpnt job walltime is extended in production. (9/11/23 update) A ldebug flag was added to the wave post ecf PBS statements to resolve long runtimes with those jobs. Initially the walltimes were extended but then the debug flag was added, the runtimes came back down, and the walltimes were reverted back. This is a temporary measure while the cause it determined and resolved. Plan to revert walltime change eventually.

Initial email from NCO:

Andrew
We schedule the GFS changes on coming Monday 9/11 at 1430z. It will be GFS v16.3.9 (from original para 
gfs v16.3.8) since we implemented an ARFC v16.3.8 (based on prod v16.3.7) on 9/6 to add ladybug in 
wave_post*pnt ecf scripts to resolve the long runtime issue.

Please let me know if you have any questions or need more information.

Thanks,

/Simon
SPA Office

Mentions of issue from SDM logs: 9/5 log:

CONTINUED...GFS WAVE WALLTIME EXTENDED/FAILURES
NWPS PROD JOB FAILURES

2a. 1104Z - SOS Fred extended the wallclock of 06Z
gfs_wave_post_bndpnt and gfs_wave_post_bndpntbll

2b. 1258Z -
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to the missing 06Z GFS data.

2c. 1346Z - GFS job finished. Fred reran the nwps prep job to
completion.

2d. 1710Z - Same for 12Z jgfs_wave_post_bndpntbll.

2e. 1808Z - Fred extended the wallclock of 12Z jgfs_wave_postpnt
job.

2f. 2318Z - SOS Houmin reported that aborted:
/prod/primary/18/gfs/v16.3/gfs/wave/post/jgfs_wave_post_bndpnt
failed due to walltime limit. Reran with increased time. (RJS)

2g. 0056Z - Houmin reports job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed as
waiting gfs post jobs.  Houmin will rerun when the gfs wave
completes. (RJS)

2h. 0247Z - SOS Kevin reports that the NWPS rerun completed.
The 18Z gfs wave is still running. 0158Z - GFS wave job
completed. (RJS)

2i. 0658Z - Job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to missing 00Z GFS wave data.  0741Z - SOS Kevin reports
that the nwps prep job rerun completed.  0825Z - SS Kevin
reports that the 00Z GFS wave post job runs completed. (RLR)

9/6 log:

CONTINUED...GFS WAVE WALLTIME EXTENDED

3a. 1212Z - SOS Ying noted the gfs_wave_post_bndpnt job was
running long and wall time for the job has been extended. 1340Z
- Complete.

3b. 2105Z - SPA Simon reports testing continues and requested
permission to increase the wallclock of the jobs that are
failing to allow ops to run smoother overnight. Approved. (KAL)

9/11 email from Simon:

The "ldebug" was added into gfs_wave post*pnt* ecf scripts to resolve the long runtime issue.
From GDIT, it will It only remounts /apps, lustre debugging actions are skipped. Since the
"ldebug" option works, we have used the original walltime for these gfs_wave post*pnt* ecf
scripts in prod gfs.v16.3.8 and v16.3.9.

Target version

v16.3.8

Expected workflow changes

Walltimes in relevant ecf scripts. Add additional debug PBS statements into wave post ecf scripts.

FYI @JessicaMeixner-NOAA

KateFriedman-NOAA commented 1 year ago

Created release branch off of dev/gfs.v16 branch for this ARFC (release/gfs.v16.3.8).

KateFriedman-NOAA commented 1 year ago

Will open separate issue to make similar change for wave post jobs in develop branch. Completing this issue.