Possible bug on GEFS fcst segment

weihuang-jedi commented 4 days ago

What is wrong?

When run gefs in segment, the fcst hour seems overlapped, as below:

[Wei.Huang@hfe03 2021032312]$ grep cfhour fcst_mem00*

fcst_mem000_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048

fcst_mem000_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

For mem000, seg 0 fcst from 00 - 48, then seg 1 from 12 to 120, should seg 1 be from 48 - 120? For mem001 and mem002, seg 0 from 00 - 120, and then seg 1 from 12 to 120, seg 1 here is not needed at all, right?

rocotostat shows this:

/apps/rocoto/1.3.7/bin/rocotostat -d c48gefs.db -w c48gefs.xml CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION

202103231200 stage_ic 803100 SUCCEEDED 0 1 21.0 202103231200 wave_init 803099 SUCCEEDED 0 1 28.0 202103231200 prep_emissions 803098 SUCCEEDED 0 1 17.0 202103231200 fcst_mem000_seg0 803195 SUCCEEDED 0 1 1164.0 202103231200 fcst_mem000_seg1 803964 SUCCEEDED 0 1 2812.0 202103231200 fcst_mem001_seg0 803196 SUCCEEDED 0 1 2859.0 202103231200 fcst_mem001_seg1 805320 SUCCEEDED 0 1 2890.0 202103231200 fcst_mem002_seg0 803197 SUCCEEDED 0 1 2850.0 202103231200 fcst_mem002_seg1 805321 SUCCEEDED 0 1 2884.0

it seems mem000 over-used 1/3 of CPU, and mem001, and mem002 doubled the CPU cost.

What should have happened?

We expect all members, if in two seg fcst, it should be: seg 0, fcst from 00 -> 48, seg 1, fcst from 48 -> 120.

What machines are impacted?

All or N/A

What global-workflow hash are you using?

The test is using EPIC's fork of global-workflow, which is point to the current develop.

Steps to reproduce

To produce on Hera:

compile as: build_all.sh -w
configure with: HPC_ACCOUNT=epic \ pslot=c48gefs \ RUNTESTS=/scratch1/NCEPDEV/stmp2/$USER/GEFSTESTS \ ./workflow/create_experiment.py \ --yaml ci/cases/pr/C48_S2SWA_gefs.yaml
start crontab.

Additional information

COMROOT and EXPDIR on Hera at:

[Wei.Huang@hfe03 GEFSTESTS]$ pwd /scratch1/NCEPDEV/stmp2/Wei.Huang/GEFSTESTS [Wei.Huang@hfe03 GEFSTESTS]$ ls -l total 8 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 COMROOT drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 EXPDIR [Wei.Huang@hfe03 GEFSTESTS]$ ls -l * COMROOT: total 4 drwxr-sr-x 4 Wei.Huang stmp 4096 Oct 10 22:57 c48gefs

EXPDIR: total 4 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 11 14:10 c48gefs

Do you have a proposed solution?

No

WalterKolczynski-NOAA commented 3 days ago

I just checked and this is definitely working correctly for gfs atm-only. Will try again with coupled, then gefs.

WalterKolczynski-NOAA commented 3 days ago

Looks like the WW3 restart files are not being written to the correct directory. There is a restart_wave directory in $DATA that is linked to $DATA_RESTART, but the restart files are getting written directly to the root $DATA. So when waves are on, it will never find wave restart files.

CC: @aerorahul

NOAA-EMC / global-workflow