NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 161 forks source link

Ocean analysis prep is failing #2723

Open guillaumevernieres opened 4 days ago

guillaumevernieres commented 4 days ago

What is wrong?

Wrong job dependency.

What should have happened?

Wait for all the necessary files to be present on disk before starting as opposed to just checking for 1.

What machines are impacted?

All or N/A

Steps to reproduce

A combination of cycling and bad luck.

Additional information

Here's the message from @JessicaMeixner-NOAA :

We had a failure in Jiande's runs: /scratch1/NCEPDEV/climate/Jiande.Wang/working/g-w-cycle/cycle/C03/COMROOT/C03/logs/2021080218/gdasocnanalprep.log.0 The issue is: File "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src/wxflow/fsutils.py", line 85, in cp raise OSError(f"unable to copy {source} to {target}") OSError: unable to copy /scratch1/NCEPDEV/climate/Jiande.Wang/working/g-w-cycle/cycle/C03/COMROOT/C03/gdas.20210802/12//model_data/ice/restart/20210802.150000.cice_model.res.nc to /scratch1/NCEPDEV/climate/Jiande.Wang/working/g-w-cycle/cycle/TMP/RUNDIRS/C03/gdasocnanal_18/Data/20210802.150000.cice_model.res.nc

Cathy's detective work, saw that this looks to be a dependency issue (copying Cathy's chat from another thread):

ocnanlprep depends on prepoceanobs. prepoceanobs depends on the existence of previous cycles ocean history f009 file. looking at the ocean history and the ocean restart timestamps, it looks like the ocean history files finish writing faster than the restarts

[Catherine.Thomas@hfe07 history]$ ls -l ../restart/ total 12357700 -rw-r--r-- 1 Jiande.Wang climate 3855258159 Jun 26 01:57 20210802.150000.MOM.res.nc -rw-r--r-- 1 Jiande.Wang climate 3892610186 Jun 26 01:57 20210802.150000.MOM.res_1.nc -rw-r--r-- 1 Jiande.Wang climate 3905057565 Jun 26 01:57 20210802.150000.MOM.res_2.nc -rw-r--r-- 1 Jiande.Wang climate 1001184700 Jun 26 01:47 20210802.150000.MOM.res_3.nc [Catherine.Thomas@hfe07 history]$ ls -l total 9210488 -rw-r--r-- 1 Jiande.Wang climate 2357833989 Jun 26 01:46 gdas.ocean.t12z.inst.f000.nc -rw-r--r-- 1 Jiande.Wang climate 2357833989 Jun 26 01:48 gdas.ocean.t12z.inst.f003.nc -rw-r--r-- 1 Jiande.Wang climate 2357833989 Jun 26 01:50 gdas.ocean.t12z.inst.f006.nc -rw-r--r-- 1 Jiande.Wang climate 2357833989 Jun 26 01:51 gdas.ocean.t12z.inst.f009.nc

Do you have a proposed solution?

No response

AndrewEichmann-NOAA commented 3 days ago

Two questions:

guillaumevernieres commented 3 days ago

@AndrewEichmann-NOAA , prepoceananal needs the backgrounds and the cice restart. ~We only need to add a dependency to the cice restart.~ Add a dependency to the forecast.

AndrewEichmann-NOAA commented 3 days ago

The reasoning here (worked out in live discussion) is that the ocnanalprep task dependency prepoceanobs has as a dependency the f009 forecast, which is produced towards the end of fcst. The concern was that having ocnanalprep wait for fcst to complete instead of the presence of the files would add a few wallclock minutes to the cycle time, but since fcst should wrap up in the time it takes prepoceanobs to complete, making fcst the dependency should be both safe and efficient.