Open weihuang-jedi opened 4 days ago
I just checked and this is definitely working correctly for gfs atm-only. Will try again with coupled, then gefs.
Looks like the WW3 restart files are not being written to the correct directory. There is a restart_wave
directory in $DATA
that is linked to $DATA_RESTART
, but the restart files are getting written directly to the root $DATA
. So when waves are on, it will never find wave restart files.
CC: @aerorahul
What is wrong?
When run gefs in segment, the fcst hour seems overlapped, as below:
[Wei.Huang@hfe03 2021032312]$ grep cfhour fcst_mem00*
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem000_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114 fcst_mem001_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120
For mem000, seg 0 fcst from 00 - 48, then seg 1 from 12 to 120, should seg 1 be from 48 - 120? For mem001 and mem002, seg 0 from 00 - 120, and then seg 1 from 12 to 120, seg 1 here is not needed at all, right?
rocotostat shows this:
/apps/rocoto/1.3.7/bin/rocotostat -d c48gefs.db -w c48gefs.xml CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
202103231200 stage_ic 803100 SUCCEEDED 0 1 21.0 202103231200 wave_init 803099 SUCCEEDED 0 1 28.0 202103231200 prep_emissions 803098 SUCCEEDED 0 1 17.0 202103231200 fcst_mem000_seg0 803195 SUCCEEDED 0 1 1164.0 202103231200 fcst_mem000_seg1 803964 SUCCEEDED 0 1 2812.0 202103231200 fcst_mem001_seg0 803196 SUCCEEDED 0 1 2859.0 202103231200 fcst_mem001_seg1 805320 SUCCEEDED 0 1 2890.0 202103231200 fcst_mem002_seg0 803197 SUCCEEDED 0 1 2850.0 202103231200 fcst_mem002_seg1 805321 SUCCEEDED 0 1 2884.0
it seems mem000 over-used 1/3 of CPU, and mem001, and mem002 doubled the CPU cost.
What should have happened?
We expect all members, if in two seg fcst, it should be: seg 0, fcst from 00 -> 48, seg 1, fcst from 48 -> 120.
What machines are impacted?
All or N/A
What global-workflow hash are you using?
The test is using EPIC's fork of global-workflow, which is point to the current develop.
Steps to reproduce
To produce on Hera:
Additional information
COMROOT and EXPDIR on Hera at:
[Wei.Huang@hfe03 GEFSTESTS]$ pwd /scratch1/NCEPDEV/stmp2/Wei.Huang/GEFSTESTS [Wei.Huang@hfe03 GEFSTESTS]$ ls -l total 8 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 COMROOT drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 EXPDIR [Wei.Huang@hfe03 GEFSTESTS]$ ls -l * COMROOT: total 4 drwxr-sr-x 4 Wei.Huang stmp 4096 Oct 10 22:57 c48gefs
EXPDIR: total 4 drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 11 14:10 c48gefs
Do you have a proposed solution?
No