gdascleanup and enkfgdascleanup failures

XuanliLi-NOAA commented 2 weeks ago

What is wrong?

gdascleanup and enkfgdascleanup jobs failed with recent build of the global workflow. Both gdas and enkfgdas directories are not being scrubbed correctly in every cycle.

Here are the error messages:

gdascleanup.log: exglobal_cleanup.sh[49]: find_exclude_string=' -name prepbufr -or -name prepbufr -or -name cnvstat -or -name prepbufr -or -name prepbufr -or -name cnvstat -or -name *atmanl.nc '

exglobal_cleanup.sh[52]: find /scratch1/NCEPDEV/stmp2/Xuanli.Li/ROTDIRS/romex_noro_newv17/gdas.20220824/18 -type f -not '(' -name 'prepbufr' -or -name 'prepbufr' -or -name 'cnvstat' -or -name 'prepbufr' -or -name 'prepbufr' -or -name 'cnvstat' -or -name '*atmanl.nc' ')' -delete find: Failed to save initial working directory: No such file or directory
exglobal_cleanup.sh[1]: postamble exglobal_cleanup.sh 1724969492 1

enkfgdascleanup.log:

exglobal_cleanup.sh[52]: find /scratch1/NCEPDEV/stmp2/Xuanli.Li/ROTDIRS/romex_noro_newv17/enkfgdas.20220824/18 -type f -not '(' -name 'f006.ens' ')' -delete find: Failed to save initial working directory: No such file or directory
exglobal_cleanup.sh[1]: postamble exglobal_cleanup.sh 1724972736 1

What should have happened?

The gdas and enkfgdas directories should be cleaned.

What machines are impacted?

Hera

Steps to reproduce

Run global workflow (Hash # ea22a737ee9a815f1f294141abf85e0d1515868f) on Hera. Resolution C384+C192.

Additional information

N/A

Do you have a proposed solution?

No response

WalterKolczynski-NOAA commented 1 week ago

Is 20220824/18 a date that was actually run, or is it just trying to delete data from before the start of the experiment period?

XuanliLi-NOAA commented 1 week ago

Cycle 20220824/18 is the one that actually ran, I couldn't go further due to disk limit. I manually deleted the directories under 00, 06, and 12 to free up space, but I kept mem001 in those directories in case you need to see what files were stored.

DavidHuber-NOAA commented 1 week ago

I (unintentionally) replicated this issue on WCOSS2 just now. The issue is that exglobal_cleanup.sh deletes its own working directory (${DATAROOT}/cleanup.${jobid}) when it deletes ${DATAROOT}: https://github.com/NOAA-EMC/global-workflow/blob/2e4f4b7671cfec83331ec26de39218543d7b9a6d/scripts/exglobal_cleanup.sh#L19

NOAA-EMC / global-workflow