NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
75 stars 168 forks source link

Possible bug on GEFS fcst segment #3001

Open weihuang-jedi opened 4 days ago

weihuang-jedi commented 4 days ago

What is wrong?

When run gefs in segment, the fcst hour seems overlapped, as below:

[Wei.Huang@hfe03 2021032312]$ grep cfhour fcst_mem00*

  1. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
  2. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
  3. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
  4. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
  5. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
  6. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
  7. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
  8. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
  9. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048

fcst_mem000_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

For mem000, seg 0 fcst from 00 - 48, then seg 1 from 12 to 120, should seg 1 be from 48 - 120? For mem001 and mem002, seg 0 from 00 - 120, and then seg 1 from 12 to 120, seg 1 here is not needed at all, right?

rocotostat shows this:

it seems mem000 over-used 1/3 of CPU, and mem001, and mem002 doubled the CPU cost.

What should have happened?

We expect all members, if in two seg fcst, it should be: seg 0, fcst from 00 -> 48, seg 1, fcst from 48 -> 120.

What machines are impacted?

All or N/A

What global-workflow hash are you using?

The test is using EPIC's fork of global-workflow, which is point to the current develop.

Steps to reproduce

To produce on Hera:

  1. compile as: build_all.sh -w
  2. configure with: HPC_ACCOUNT=epic \ pslot=c48gefs \ RUNTESTS=/scratch1/NCEPDEV/stmp2/$USER/GEFSTESTS \ ./workflow/create_experiment.py \ --yaml ci/cases/pr/C48_S2SWA_gefs.yaml
  3. start crontab.

Additional information

COMROOT and EXPDIR on Hera at:

[Wei.Huang@hfe03 GEFSTESTS]$ pwd /scratch1/NCEPDEV/stmp2/Wei.Huang/GEFSTESTS [Wei.Huang@hfe03 GEFSTESTS]$ ls -l total 8 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 COMROOT drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 EXPDIR [Wei.Huang@hfe03 GEFSTESTS]$ ls -l * COMROOT: total 4 drwxr-sr-x 4 Wei.Huang stmp 4096 Oct 10 22:57 c48gefs

EXPDIR: total 4 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 11 14:10 c48gefs

Do you have a proposed solution?

No

WalterKolczynski-NOAA commented 3 days ago

I just checked and this is definitely working correctly for gfs atm-only. Will try again with coupled, then gefs.

WalterKolczynski-NOAA commented 3 days ago

Looks like the WW3 restart files are not being written to the correct directory. There is a restart_wave directory in $DATA that is linked to $DATA_RESTART, but the restart files are getting written directly to the root $DATA. So when waves are on, it will never find wave restart files.

CC: @aerorahul