NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 166 forks source link

Incorrect gfs job dependencies caused gfsmetpg2g1 failure #2899

Open RussTreadon-NOAA opened 1 week ago

RussTreadon-NOAA commented 1 week ago

What is wrong?

Jobs gfscleanup and gfsmetpg2g1 both depend upon the completion of gfsarch. As such, it is possible for both jobs to be concurrently running. This is problematic. gfscleanup removes the directory in which the gfsmetpg2g1 job is running.

This behavior was observed on Hera but likely impacts all machines.

What should have happened?

The gfsmetp suite of jobs should run to completion before gfscleanup removes the run directory.

What machines are impacted?

All or N/A, Hera

Steps to reproduce

  1. clone g-w develop
  2. set up g-w CI for GSI or JEDI ATM based DA
  3. cycle to gfsmetp jobs

Additional information

A test of g-w CI _C96C48_ufshybatmDA on Hera encountered the following scenario.

The 2024022400 gfsarch job completed. rocotorun submitted gfscleanup and gfsmetpg2g1. Both of these jobs have a single xml dependency. This single dependency is completion of gfsarch.

        <dependency>
                <and>
                        <taskdep task="gfsarch"/>
                </and>
        </dependency>

Jobs gfsmetgp2g1 and gfscleanup started at the same time, Mon Sep 9 17:03:20 UTC 2024 gfscleanup finished at Mon Sep 9 17:03:48 UTC 2024. One of the last actions gfscleanup does is to remove the top-level gfs run directory for the cycle

+ exglobal_cleanup.sh[118]: rm -rf /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400
+ exglobal_cleanup.sh[120]: echo 'Cleanup /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400 completed!'

Unfortunately, gfsmetpg2g1 was running in /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400/metpg2g1.2502384. Removal of /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest/gfs.2024022400 deleted the gfsmetpg2g1 run directory. Job gfsmetpg2g1 aborted at Mon Sep 9 17:03:52 UTC 2024 with the error messges

OSError: [Errno 116] Stale file handle: 'python_gen_env_vars.sh'
+ exgrid2grid_step1.sh[46]: status=1
+ exgrid2grid_step1.sh[47]: [[ 1 -ne 0 ]]
+ exgrid2grid_step1.sh[47]: exit 1
+ JGFS_ATMOS_VERIFICATION[1]: postamble JGFS_ATMOS_VERIFICATION 1725901403 1

Do you have a proposed solution?

No response

DavidHuber-NOAA commented 5 days ago

I will fix this as part of #2907