NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 167 forks source link

Delay gdas cleanup until gfs cycle completes #2750

Open CatherineThomas-NOAA opened 3 months ago

CatherineThomas-NOAA commented 3 months ago

What new functionality do you need?

When running cycling experiments, the long forecast/gfs cycle uses the gdas cycle as initial conditions. If the gfs cycle fails and the experiment keeps running, the gdas cycle will potentially be removed by the cleanup step before the gfs can be rerun. For example, when cycling with 3DVar and gfs_cyc=1, the gfsfcst can fail and the needed gdas cycle can be removed all in one overnight.

What are the requirements for the new functionality?

Have the cleanup step check if a potential dependent cycle has completed before removing needed directories.

Acceptance Criteria

When cycling and running the long forecast, retain the needed gdas directories until the long forecast completes.

Suggest a solution (optional)

I understand that the dependencies here might be tricky, but I'm hoping to at least start a conversation about how we can add a check for this while also not letting the ROTDIR get bloated.

aerorahul commented 3 months ago

As you note, the dependencies are going to be difficult to be coded for such an edge case. SInce arch.sh runs before clean-up, perhaps bringing down the ICs from HPSS can be done in the case where the gfs cycle has failed, rather than saving data on disk for longer periods of time (We have disk issues already causing all kinds of issues) Also, rocoto has a cycle throttle of 3, which means that a failed gfs forecast will knock the cycling to 2 allowable cycles. This will slow down the throughput of the experiment, forcing the user to either boot the failed cycle or take action (save ics for evaluation elsewhere, download from hpss, etc.)

The long gfs forecast (RUN=gfs) depends on ICs from RUN=gdas. What we could possibly try to do is if this forecast fails, it copies the ICs into the gfs.PDY/cyc There will be duplicate information required from forecast_postdet.sh to determine exactly what configuration (IAU, warm start, cold start, coupled, uncoupled, etc)

We can think about it and discuss more, but the best/quickest work around (IMO) is to get the ICs from HPSS as they are being archived there in the gdasrestarta/b (I think) tarballs.

CatherineThomas-NOAA commented 2 months ago

Thanks @aerorahul. Copying the ICs into the gfs directory on a failure is a good option that I hadn't considered.

As for the HPSS option, the restart a's and b's are not currently being saved properly (#2722) but @DavidHuber-NOAA is working on this (#2735). Once that PR has been merged, we should make sure that we can reproduce the gfs forecast from those restarts and I guess we'll go from there.