NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

The enkfgdaseobs job can fail to collect all necessary data #2092

Open DavidHuber-NOAA opened 7 months ago

DavidHuber-NOAA commented 7 months ago

What is wrong?

If the enkfgdaseobs job is run with more processors than (MPI tasks) x (threads), data will be left on the floor and result in an incomplete analysis. Kludges have been placed for S4 and Jet, but new systems with different core/node counts will need similar kludges.

What should have happened?

The enkfgdaseobs job should be able to collect all necessary data regardless of how many cores are used.

What machines are impacted?

All or N/A

Steps to reproduce

  1. Setup a cycled experiment and modify config.resources to use a different number of PEs for the eobs job
  2. Run the enkfgdaseobs job and plot the resulting ingested data points

An example pair of plots from @CoryMartin-NOAA is below:

MissedData

Additional information

This was first captured in #154.

Do you have a proposed solution?

I'm not sure if this is a scripting change in the global-workflow or a code change in the GSI. But once it is fixed, the config.resources file should be simplified to use the same number of processes across all systems.

DavidHuber-NOAA commented 6 days ago

I believe that the problematic code is located here: https://github.com/NOAA-EMC/global-workflow/blob/de8706702ead0630beb54d868f83aa2cb23f8f79/scripts/exglobal_atmos_analysis.sh#L576-L593

Looping over npe_gsi-1 will not create all of the links necessary if npe does not equal ncpus=(npe_node*nodes). To fix this, the loop should be changed to loop over npe_node * nnodes - 1.