NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

The Archive job does not do online archive correctly for both GDAS and GFS cycles #2673

Closed emilyhcliu closed 1 week ago

emilyhcliu commented 3 weeks ago

What is wrong?

I am running two experiments using two different global-workflow versions: Exp1 - uses hash# 59cdc0ee81926ee8dc7b8e544337bfc85130ad18 (last updated on April 5, 2024) Exp2 - uses hash# acf3aaa2b1d3e3024b0b5d2fe23eee8c317a980b (last updated on June 6, 2024)

For both runs (Exp1 and Exp2), the pgb files created were copied from RUNDIR to ROTDIR under the productdirectory for GDAS and GFS cycles without any problems. The exp1 experiment did not have an online archive problem. However, the online archive job for Exp2 has missing files for both GDAS and GFS runs.

The archive job has two parts: one is the HPSS archive, and the other is the online archive. There were no problems with the HPSS archive. However, the online archive job has issues:

  1. For the GDAS cycle: gsistat and pgbanl files were not archived; other files were OK
  2. For the GFS cycle: No files were archived.

The archive job (develop version) is processed using exglobal_archive.py with arcdir.yaml as input.
There was a PR #2621 related to the archive job merged on June 1.
There was a refactoring of the arcdir.yaml.j2, which may be related to the problem with the online archive job reported in this issue.

What should have happened?

For GDAS and GFS cycles, both analysis and forecast pgb files should be archived on disk (online archive) along with gsistat files.

What machines are impacted?

All or N/A

Steps to reproduce

  1. Check out the latest global-workflow from develop
  2. Configure to run both GDAS and GFS for one cycle.

Additional information

My Exp2 run:

HOMEgfs:/scratch1/NCEPDEV/da/Emily.Liu/git/Global-Workflow/global-workflow-thompson-enkffix EXPDIR: /scratch1/NCEPDEV/da/Emily.Liu/para/v17/v17allskyens ROTDIR:/scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens ARCDIR:/scratch1/NCEPDEV/da/Emily.Liu/archive/v17allskyens

Related log files: /scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens/logs/2023040300/gdasarch.log /scratch2/NCEPDEV/stmp3/Emily.Liu/ROTDIRS/v17allskyens/logs/2023040300/gfsarch.log

Do you have a proposed solution?

Debug exglobal_archive.py and its related scripts and yaml files (e.g. arcdir.yaml.j2)

emilyhcliu commented 3 weeks ago

Tagging @azadeh-gh for awareness.

DavidHuber-NOAA commented 3 weeks ago

Thanks for letting me know about this, @emilyhcliu. I will take a look today and see what's going on.

emilyhcliu commented 3 weeks ago

@DavidHuber-NOAA Do you have a timeline for fixing the online archive issue? We have three experiments running with the latest global workflow, which includes the archive refactoring work merged on June 1. Knowing the timeline for fixing the issue will help us decide whether to wait for the fix or rebuild the global workflow with an earlier version (before June 1) for the experiments.
Thanks!

DavidHuber-NOAA commented 2 weeks ago

@emilyhcliu I expect to have a fix in by mid next week at the latest. I did some exploratory work yesterday and have an idea of the root cause, but there's still some more debugging work to do.