NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 161 forks source link

Cleanup of stale RUNDIRS from an experiment #2719

Closed aerorahul closed 4 days ago

aerorahul commented 5 days ago

Description

This PR:

Type of change

Change characteristics

How has this been tested?

Checklist

emcbot commented 5 days ago

CI Update on Wcoss2 at 06/25/24 10:27:08 PM
============================================
Cloning and Building global-workflow PR: 2719
with PID: 222260 on host: dlogin08
emcbot commented 5 days ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Tue Jun 25 22:31:13 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/25/24 11:10:12 PM
Case setup: Completed for experiment C48_ATM_923039db
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_923039db
Case setup: Skipped for experiment C48_S2SWA_gefs_923039db
Case setup: Completed for experiment C48_S2SW_923039db
Case setup: Completed for experiment C96_atm3DVar_extended_923039db
Case setup: Skipped for experiment C96_atm3DVar_923039db
Case setup: Skipped for experiment C96_atmaerosnowDA_923039db
Case setup: Completed for experiment C96C48_hybatmDA_923039db
Case setup: Completed for experiment C96C48_ufs_hybatmDA_923039db
emcbot commented 5 days ago

Experiment C48_ATM_923039db SUCCESS on Wcoss2 at 06/26/24 12:32:19 AM

emcbot commented 5 days ago

Experiment C48_S2SW_923039db SUCCESS on Wcoss2 at 06/26/24 12:36:22 AM

emcbot commented 5 days ago

Experiment C96C48_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:23 AM

emcbot commented 5 days ago

Experiment C96C48_ufs_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:27 AM

emcbot commented 4 days ago

Experiment C96_atm3DVar_extended_923039db SUCCESS on Wcoss2 at 06/26/24 10:16:33 AM

emcbot commented 4 days ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_923039db *** SUCCESS *** at 06/26/24 12:32:19 AM
Experiment C48_S2SW_923039db *** SUCCESS *** at 06/26/24 12:36:22 AM
Experiment C96C48_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:23 AM
Experiment C96C48_ufs_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:27 AM
Experiment C96_atm3DVar_extended_923039db *** SUCCESS *** at 06/26/24 10:16:33 AM
CoryMartin-NOAA commented 3 days ago

FYI I'm not sure this works properly. A brand new experiment today is failing in the enkfgdascleanup job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that the find command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.

aerorahul commented 3 days ago

FYI I'm not sure this works properly. A brand new experiment today is failing in the enkfgdascleanup job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that the find command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.

hmm .. race conditions are the best.

CoryMartin-NOAA commented 3 days ago

With no other jobs running, a rocotoboot causes it to run to completion. So yeah I am 95% sure this is the root cause of the issue. The cleanup cannot run while any other job is working in the RUNDIRS

aerorahul commented 2 days ago

@CoryMartin-NOAA I will open a PR to revert this.

@RussTreadon-NOAA Would you be willing to comment out the 2 find and remove lines in scripts/exglobal_cleanup.sh in your PR #2700 and we can kick off the testing in it again? Thanks!

RussTreadon-NOAA commented 2 days ago

@aerorahul , An updated exglobal_cleanup.sh has been committed to feature/rename_atm. Done at c1ef4b30.

aerorahul commented 2 days ago

The CI has been kicked off in #2700 on Hera.