Closed aerorahul closed 4 days ago
CI Update on Wcoss2 at 06/25/24 10:27:08 PM
============================================
Cloning and Building global-workflow PR: 2719
with PID: 222260 on host: dlogin08
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Tue Jun 25 22:31:13 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/25/24 11:10:12 PM
Case setup: Completed for experiment C48_ATM_923039db
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_923039db
Case setup: Skipped for experiment C48_S2SWA_gefs_923039db
Case setup: Completed for experiment C48_S2SW_923039db
Case setup: Completed for experiment C96_atm3DVar_extended_923039db
Case setup: Skipped for experiment C96_atm3DVar_923039db
Case setup: Skipped for experiment C96_atmaerosnowDA_923039db
Case setup: Completed for experiment C96C48_hybatmDA_923039db
Case setup: Completed for experiment C96C48_ufs_hybatmDA_923039db
Experiment C48_ATM_923039db SUCCESS on Wcoss2 at 06/26/24 12:32:19 AM
Experiment C48_S2SW_923039db SUCCESS on Wcoss2 at 06/26/24 12:36:22 AM
Experiment C96C48_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:23 AM
Experiment C96C48_ufs_hybatmDA_923039db SUCCESS on Wcoss2 at 06/26/24 02:12:27 AM
Experiment C96_atm3DVar_extended_923039db SUCCESS on Wcoss2 at 06/26/24 10:16:33 AM
All CI Test Cases Passed on Wcoss2:
Experiment C48_ATM_923039db *** SUCCESS *** at 06/26/24 12:32:19 AM
Experiment C48_S2SW_923039db *** SUCCESS *** at 06/26/24 12:36:22 AM
Experiment C96C48_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:23 AM
Experiment C96C48_ufs_hybatmDA_923039db *** SUCCESS *** at 06/26/24 02:12:27 AM
Experiment C96_atm3DVar_extended_923039db *** SUCCESS *** at 06/26/24 10:16:33 AM
FYI I'm not sure this works properly. A brand new experiment today is failing in the enkfgdascleanup
job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that the find
command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.
FYI I'm not sure this works properly. A brand new experiment today is failing in the
enkfgdascleanup
job. My hypothesis is that files are being deleted by other jobs running concurrently and causing the cleanup job to fail. I fear that thefind
command finds all the files first, then it goes through them to check the dates, and if the file is subsequently deleted (by a successful job), the cleanup will crash. But this is purely a hypothesis.
hmm .. race conditions are the best.
With no other jobs running, a rocotoboot causes it to run to completion. So yeah I am 95% sure this is the root cause of the issue. The cleanup cannot run while any other job is working in the RUNDIRS
@CoryMartin-NOAA I will open a PR to revert this.
@RussTreadon-NOAA
Would you be willing to comment out the 2 find and remove lines in scripts/exglobal_cleanup.sh
in your PR #2700 and we can kick off the testing in it again? Thanks!
@aerorahul , An updated exglobal_cleanup.sh
has been committed to feature/rename_atm
. Done at c1ef4b30.
The CI has been kicked off in #2700 on Hera.
Description
This PR:
$DATAROOT/
every 3 days.RUNDIR
defined inconfig.base
Type of change
Change characteristics
How has this been tested?
Checklist