NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)

https://global-workflow.readthedocs.io/en/latest

GNU Lesser General Public License v3.0

70 stars 161 forks source link

Cleanup of stale RUNDIRS from an experiment #2719

Closed aerorahul closed 4 days ago

aerorahul commented 5 days ago

Description

This PR:

removes stale temporary scratch run directories from $DATAROOT/ every 3 days.
should help to scrub failed attempts.
removes an unused variable RUNDIR defined in config.base

Type of change

Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

Is this a breaking change (a change in existing functionality)? NO
Does this change require a documentation update? NO

How has this been tested?

The segment of the change was tested in an offline shell script on some older RUNDIRs from previous experiments.

Checklist

[ ] Any dependent changes have been merged and published
[ ] My code follows the style guidelines of this project
[ ] I have performed a self-review of my own code
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] My changes generate no new warnings
[ ] New and existing tests pass with my changes
[ ] I have made corresponding changes to the documentation if necessary

emcbot commented 5 days ago


CI Update on Wcoss2 at 06/25/24 10:27:08 PM
============================================
Cloning and Building global-workflow PR: 2719
with PID: 222260 on host: dlogin08

emcbot commented 5 days ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Tue Jun 25 22:31:13 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/25/24 11:10:12 PM
Case setup: Completed for experiment C48_ATM_923039db
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_923039db
Case setup: Skipped for experiment C48_S2SWA_gefs_923039db
Case setup: Completed for experiment C48_S2SW_923039db
Case setup: Completed for experiment C96_atm3DVar_extended_923039db
Case setup: Skipped for experiment C96_atm3DVar_923039db
Case setup: Skipped for experiment C96_atmaerosnowDA_923039db
Case setup: Completed for experiment C96C48_hybatmDA_923039db
Case setup: Completed for experiment C96C48_ufs_hybatmDA_923039db

emcbot commented 5 days ago

Experiment C48_ATM_923039db SUCCESS on Wcoss2 at 06/26/24 12:32:19 AM

emcbot commented 5 days ago

Experiment C48_S2SW_923039db SUCCESS on Wcoss2 at 06/26/24 12:36:22 AM

emcbot commented 5 days ago

Experiment C96C48_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:23 AM

emcbot commented 5 days ago

Experiment C96C48_ufs_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:27 AM

emcbot commented 4 days ago

Experiment C96_atm3DVar_extended_923039db SUCCESS on Wcoss2 at 06/26/24 10:16:33 AM

emcbot commented 4 days ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_923039db *** SUCCESS *** at 06/26/24 12:32:19 AM
Experiment C48_S2SW_923039db *** SUCCESS *** at 06/26/24 12:36:22 AM
Experiment C96C48_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:23 AM
Experiment C96C48_ufs_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:27 AM
Experiment C96_atm3DVar_extended_923039db *** SUCCESS *** at 06/26/24 10:16:33 AM

CoryMartin-NOAA commented 3 days ago

FYI I'm not sure this works properly. A brand new experiment today is failing in the enkfgdascleanup job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that the find command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.

aerorahul commented 3 days ago

FYI I'm not sure this works properly. A brand new experiment today is failing in the enkfgdascleanup job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that the find command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.

hmm .. race conditions are the best.

CoryMartin-NOAA commented 3 days ago

With no other jobs running, a rocotoboot causes it to run to completion. So yeah I am 95% sure this is the root cause of the issue. The cleanup cannot run while any other job is working in the RUNDIRS

aerorahul commented 2 days ago

@CoryMartin-NOAA I will open a PR to revert this.

@RussTreadon-NOAA Would you be willing to comment out the 2 find and remove lines in scripts/exglobal_cleanup.sh in your PR #2700 and we can kick off the testing in it again? Thanks!

RussTreadon-NOAA commented 2 days ago

@aerorahul , An updated exglobal_cleanup.sh has been committed to feature/rename_atm. Done at c1ef4b30.

aerorahul commented 2 days ago

The CI has been kicked off in #2700 on Hera.