Port global-workflow develop branch to WCOSS2

NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)

https://global-workflow.readthedocs.io/en/latest

GNU Lesser General Public License v3.0

75 stars 171 forks source link

Port global-workflow develop branch to WCOSS2 #419

Closed KateFriedman-NOAA closed 1 year ago

KateFriedman-NOAA commented 3 years ago

Port the global-workflow develop branch (~early GFSv17) to the new WCOSS2 production machines (Cactus/Dogwood).

Overarching epic issue #398.

[x] Add updated ecf scripts from operations for GFSv16.2.#
[x] Retire jlogfiles from scripts (@WalterKolczynski-NOAA partially took care of in PR #929)
[x] Create WCOSS2.env
[x] Pull in /gempak/ush script updates (PR #920)
[x] Pull in /jobs JJOB scripts updates
[x] Pull in /jobs/rocoto script updates
[x] Pull in /scripts script updates
[x] Pull in /ush script updates
[x] Pull in transfer file updates to /parm/transfer (PR #918)
[x] Update /util/sorc for WCOSS2 (?)
[x] Create WCOSS2 modulefiles (PR #1002)
[x] Convert modulefiles to LUA format (issue #670)
[x] Pull in config file updates (non-resource updates)
[x] Add WCOSS2 resources into configs
[x] Add WCOSS2 to machine-setup.sh (PR #1002)
[x] Add WCOSS2 to build_all.sh (PR #1002)
[x] Update and test build scripts for WCOSS2 (PR #1002)
[x] Add WCOSS2 to setup scripts
[ ] Introduce version (.ver) files and use module version variables in modulefiles. (Issue #697)

Companion issue #399 for operations branch.

KateFriedman-NOAA commented 3 years ago

Branch for this work: feature/dev-wcoss2

KateFriedman-NOAA commented 3 years ago

Sync merged with develop branch @ 7233d0c4

KateFriedman-NOAA commented 3 years ago

From @DusanJovic-NOAA :

I have this branch based on latest develop:

https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/acorn_rt

but it only works on Acorn, due to missing libraries on Cactus/Dogwood.

Will wait for missing libraries to be resolved and then try building/testing from workflow.

KateFriedman-NOAA commented 2 years ago

@JiayiPeng-NOAA Is the global_extrkr.sh script being used anywhere for the tracker? I'm not seeing this ush script being used in operations (I checked logs from yesterday) nor in my own testing. I'm also not seeing it being invoked in any scripts in global-workflow. I'm wondering if I can remove this script from global-workflow. Thanks!

JiayiPeng-NOAA commented 2 years ago

Hi Kate, No one is using "global_extrkr.sh" for TC tracking now. You can remove it. Thanks, Jiayi

On Wed, Sep 21, 2022 at 3:23 PM Kate Friedman @.***> wrote:

@JiayiPeng-NOAA https://github.com/JiayiPeng-NOAA Is the global_extrkr.sh script being used anywhere for the tracker? I'm not seeing this ush script being used in operations (I checked logs from yesterday) nor in my own testing. I'm also not seeing it being invoked in any scripts in global-workflow. I'm wondering if I can remove this script from global-workflow. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/419#issuecomment-1254131874, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALRTVNN6O3A7KFBI7P244XDV7NOD5ANCNFSM5CMEHY5A . You are receiving this because you were mentioned.Message ID: @.***>

KateFriedman-NOAA commented 2 years ago

Great, thanks for confirming @JiayiPeng-NOAA !

KateFriedman-NOAA commented 2 years ago

@junwang-noaa @GeorgeVandenberghe-NOAA I have been testing global-workflow develop on WCOSS2 and want to let you know my observations regarding resources for different resolutions (both deterministic and ensemble forecast jobs). I will go over issues revolving around C384 first.

C384

Current resources in global-workflow develop fail for gdasfcst and gdasefcs jobs:
```
export layout_x=6
export layout_y=8
export nth_fv3=2
export WRITE_GROUP=1
export WRTTASK_PER_GROUP=64
```
(Note, WRTTASK_PER_GROUP was the max # of cores/node on the R&Ds (40) but since that would be 128 on WCOSS2 I changed it to 64 on that machine (which is what we use in GFSv16 ops). I left the other develop branch C384 resource values as-is.
```
<nodes>6:ppn=64:tpp=2</nodes>
```
```
mpiexec -l -n 352 -ppn 64 --cpu-bind depth --depth 2
```

log (Cactus): /lfs/h2/emc/ptmp/kate.friedman/comrot/devcyc384a/logs/2022010400/gdasfcst.log.2

If I run with the C384 values from GFSv16 ops it works:
```
export layout_x=8
export layout_y=8
export nth_fv3=1
export WRITE_GROUP=2
export WRTTASK_PER_GROUP=64
```
...although (as reported in an email) I get occasional hangs and/or slowness in the ensemble forecast jobs in a C768C384L127 real-time run I have going on. Reruns of those failed jobs always succeed and do not hang. Seems like a system issue?
```
<nodes>4:ppn=128:tpp=1</nodes>
```
```
mpiexec -l -n 512 -ppn 128 --cpu-bind depth --depth 1
```

If I adjust resources one at a time in an attempt to keep the current develop branch resources, the C384 forecast jobs continue to fail (hang). Here is a listing of the output when I try <nodes>4:ppn=128:tpp=1</nodes> (same as ops but layout_x=6 instead of layout_x=8 like in ops) and it hangs:

-rw-r--r--   1 kate.friedman emc      705380992 Oct  3 15:24 gdas.t00z.master.grb2f000
-rw-r--r--   1 kate.friedman emc       36148790 Oct  3 15:24 gdas.t00z.sfluxgrbf000.grib2
-rw-r--r--   1 kate.friedman emc     1863550427 Oct  3 15:24 gdas.t00z.atmf000.nc
-rw-r--r--   1 kate.friedman emc      330393546 Oct  3 15:24 gdas.t00z.sfcf000.nc
-rw-r--r--   1 kate.friedman emc             71 Oct  3 15:24 gdas.t00z.logf000.txt
-rw-r--r--   1 kate.friedman emc      738503203 Oct  3 15:26 gdas.t00z.master.grb2f001
-rw-r--r--   1 kate.friedman emc       80103749 Oct  3 15:26 gdas.t00z.sfluxgrbf001.grib2
-rw-r--r--   1 kate.friedman emc       53204471 Oct  3 15:26 gdas.t00z.atmf001.nc
-rw-r--r--   1 kate.friedman emc      738334513 Oct  3 15:28 gdas.t00z.master.grb2f002
-rw-r--r--   1 kate.friedman emc       81136189 Oct  3 15:28 gdas.t00z.sfluxgrbf002.grib2
-rw-r--r--   1 kate.friedman emc     1852755646 Oct  3 15:28 gdas.t00z.atmf002.nc
-rw-r--r--   1 kate.friedman emc      326373199 Oct  3 15:29 gdas.t00z.sfcf002.nc
-rw-r--r--   1 kate.friedman emc             71 Oct  3 15:29 gdas.t00z.logf002.txt
-rw-r--r--   1 kate.friedman emc      738192427 Oct  3 15:33 gdas.t00z.master.grb2f004
-rw-r--r--   1 kate.friedman emc       83098082 Oct  3 15:33 gdas.t00z.sfluxgrbf004.grib2
-rw-r--r--   1 kate.friedman emc     1825697152 Oct  3 15:33 gdas.t00z.atmf004.nc
-rw-r--r--   1 kate.friedman emc      332614803 Oct  3 15:33 gdas.t00z.sfcf004.nc
-rw-r--r--   1 kate.friedman emc             71 Oct  3 15:33 gdas.t00z.logf004.txt

^ note some files (e.g. sfcf001 and all f003 are missing!

It seems that using the current GFSv16 C384 resources for the GFSv17 jobs give both the deterministic gdasfcst and enkf gdasefcs jobs enough resources on WCOSS2...even though it's fewer nodes (4) than the default values (6) provide. It's more tasks (512 vs 352) using the GFSv16 ops values.

Other resolutions

The C96 (only tested for ensemble), C192 and C768, as well as the C384 gfsfcst resources in global-workflow develop all work fine on WCOSS2:

C768 gdasfcst:

<nodes>22:ppn=32:tpp=4</nodes>
mpiexec -l -n 704 -ppn 32 --cpu-bind depth --depth 4

C768 gfsfcst:

<nodes>44:ppn=32:tpp=4</nodes>
mpiexec -l -n 1408 -ppn 32 --cpu-bind depth --depth 4

C384 gfsfcst:

<nodes>11:ppn=64:tpp=2</nodes>
mpiexec -l -n 704 -ppn 64 --cpu-bind depth --depth 2

C192 gdasfcst:

<nodes>4:ppn=64:tpp=2</nodes>
mpiexec -l -n 208 -ppn 64 --cpu-bind depth --depth 2

C192 gfsfcst:

<nodes>5:ppn=64:tpp=2</nodes>
mpiexec -l -n 272 -ppn 64 --cpu-bind depth --depth 2

C96 gdasefcs:

<nodes>2:ppn=128:tpp=1</nodes>
mpiexec -l -n 208 -ppn 128 --cpu-bind depth --depth 1

Note: the above work but are not yet optimized for GFSv17.

GeorgeVandenberghe-NOAA commented 2 years ago

I have run C768L128 as Kate described but with production ATM resources. From this

C768 gfsfcst:

44:ppn=32:tpp=4 mpiexec -l -n 1408 -ppn 32 --cpu-bind depth --depth 4

I changed to

150:ppn=24:tpp=5 mpiexec -l -n 3456 -ppn 5 --cpu-bind depth --depth 5

This so far runs robustly in about 8100 wallclock seconds.

Runs using Kate's original resources were also robust except I changed PPN to 24 and tpp to 5 (inherited from my prod script)

IT SHOULD BE NOTED RUNS USING A SINGLE THREAD AND DEPTH 1 are NOT ROBUST, failing numerous days into the run with thread memory allocation errors.

On Mon, Oct 3, 2022 at 4:06 PM Kate Friedman @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I have been testing global-workflow develop on WCOSS2 and want to let you know my observations regarding resources for different resolutions (both deterministic and ensemble forecast jobs). I will go over issues revolving around C384 first. C384

Current resources in global-workflow develop fail for gdasfcst and gdasefcs jobs:

export layout_x=6 export layout_y=8 export nth_fv3=2 export WRITE_GROUP=1 export WRTTASK_PER_GROUP=64

(Note, WRTTASK_PER_GROUP was the max # of cores/node on the R&Ds (40) but since that would be 128 on WCOSS2 I changed it to 64 on that machine (which is what we use in GFSv16 ops). I left the other develop branch C384 resource values as-is.
6:ppn=64:tpp=2
mpiexec -l -n 352 -ppn 64 --cpu-bind depth --depth 2

log (Cactus): /lfs/h2/emc/ptmp/kate.friedman/comrot/devcyc384a/logs/2022010400/gdasfcst.log.2

If I run with the C384 values from GFSv16 ops it works:

export layout_x=8 export layout_y=8 export nth_fv3=1 export WRITE_GROUP=2 export WRTTASK_PER_GROUP=64

...although (as reported in an email) I get occasional hangs and/or slowness in the ensemble forecast jobs in a C768C384L127 real-time run I have going on. Reruns of those failed jobs always succeed and do not hang. Seems like a system issue?
4:ppn=128:tpp=1
mpiexec -l -n 512 -ppn 128 --cpu-bind depth --depth 1

If I adjust resources one at a time in an attempt to keep the current develop branch resources, the C384 forecast jobs continue to fail (hang). Here is a listing of the output when I try 4:ppn=128:tpp=1 (same as ops but layout_x=6 instead of layout_x=8 like in ops) and it hangs:

-rw-r--r-- 1 kate.friedman emc 705380992 Oct 3 15:24 gdas.t00z.master.grb2f000 -rw-r--r-- 1 kate.friedman emc 36148790 Oct 3 15:24 gdas.t00z.sfluxgrbf000.grib2 -rw-r--r-- 1 kate.friedman emc 1863550427 Oct 3 15:24 gdas.t00z.atmf000.nc -rw-r--r-- 1 kate.friedman emc 330393546 Oct 3 15:24 gdas.t00z.sfcf000.nc -rw-r--r-- 1 kate.friedman emc 71 Oct 3 15:24 gdas.t00z.logf000.txt -rw-r--r-- 1 kate.friedman emc 738503203 Oct 3 15:26 gdas.t00z.master.grb2f001 -rw-r--r-- 1 kate.friedman emc 80103749 Oct 3 15:26 gdas.t00z.sfluxgrbf001.grib2 -rw-r--r-- 1 kate.friedman emc 53204471 Oct 3 15:26 gdas.t00z.atmf001.nc -rw-r--r-- 1 kate.friedman emc 738334513 Oct 3 15:28 gdas.t00z.master.grb2f002 -rw-r--r-- 1 kate.friedman emc 81136189 Oct 3 15:28 gdas.t00z.sfluxgrbf002.grib2 -rw-r--r-- 1 kate.friedman emc 1852755646 Oct 3 15:28 gdas.t00z.atmf002.nc -rw-r--r-- 1 kate.friedman emc 326373199 Oct 3 15:29 gdas.t00z.sfcf002.nc -rw-r--r-- 1 kate.friedman emc 71 Oct 3 15:29 gdas.t00z.logf002.txt -rw-r--r-- 1 kate.friedman emc 738192427 Oct 3 15:33 gdas.t00z.master.grb2f004 -rw-r--r-- 1 kate.friedman emc 83098082 Oct 3 15:33 gdas.t00z.sfluxgrbf004.grib2 -rw-r--r-- 1 kate.friedman emc 1825697152 Oct 3 15:33 gdas.t00z.atmf004.nc -rw-r--r-- 1 kate.friedman emc 332614803 Oct 3 15:33 gdas.t00z.sfcf004.nc -rw-r--r-- 1 kate.friedman emc 71 Oct 3 15:33 gdas.t00z.logf004.txt

^ note some files (e.g. sfcf001 and all f003 are missing!

It seems that using the current GFSv16 C384 resources for the GFSv17 jobs give both the deterministic gdasfcst and enkf gdasefcs jobs enough resources on WCOSS2...even though it's fewer nodes (4) than the default values (6) provide. It's more tasks (512 vs 352) using the GFSv16 ops values. Other resolutions

The C96 (only tested for ensemble), C192 and C768, as well as the C384 gfsfcst resources in global-workflow develop all work fine on WCOSS2:

C768 gdasfcst:
22:ppn=32:tpp=4
mpiexec -l -n 704 -ppn 32 --cpu-bind depth --depth 4

C768 gfsfcst:
44:ppn=32:tpp=4
mpiexec -l -n 1408 -ppn 32 --cpu-bind depth --depth 4

C384 gfsfcst:
11:ppn=64:tpp=2
mpiexec -l -n 704 -ppn 64 --cpu-bind depth --depth 2

C192 gdasfcst:
4:ppn=64:tpp=2
mpiexec -l -n 208 -ppn 64 --cpu-bind depth --depth 2

C192 gfsfcst:
5:ppn=64:tpp=2
mpiexec -l -n 272 -ppn 64 --cpu-bind depth --depth 2

C96 gdasefcs:
2:ppn=128:tpp=1
mpiexec -l -n 208 -ppn 128 --cpu-bind depth --depth 1

Note: the above work but are not yet optimized for GFSv17.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/419#issuecomment-1265688976, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQZI3TQTR7R4JSOXVLWBL765ANCNFSM5CMEHY5A . You are receiving this because you were mentioned.Message ID: @.***>

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

KateFriedman-NOAA commented 2 years ago

150:ppn=24:tpp=5
mpiexec -l -n 3456 -ppn 5 --cpu-bind depth --depth 5

Question for both @junwang-noaa and @GeorgeVandenberghe-NOAA : For developer runs, do we want to use the larger number of nodes for C768 on WCOSS2 now or stick with the current settings for now? The current settings are what are set for the R&Ds for that resolution right now. Many developers will likely run shorter forecasts so fewer resources now would be ok. Definitely need to use more nodes and optimize ahead of hand-off of v17.

This so far runs robustly in about 8100 wallclock seconds.

@GeorgeVandenberghe-NOAA What forecast length is that timing? 384hrs?

The 384hr gfsfcst jobs in my real-time run (<nodes>44:ppn=32:tpp=4</nodes>, mpiexec -l -n 1408 -ppn 32 --cpu-bind depth --depth 4) take ~19,000s. See column B in the "gfs" sheet in this document for my timings:

https://docs.google.com/spreadsheets/d/1bc0pLToSFGmiFTfPIS-w3_-jxOrBvKlonEJUDvchShE/edit#gid=2127680143

KateFriedman-NOAA commented 2 years ago

@junwang-noaa @GeorgeVandenberghe-NOAA Another thing I didn't mention above...I've found that the C192 enkf forecast jobs fail on WCOSS2 with 2 threads but work with 1 thread. The deterministic C192 gdas[gfs]fcst jobs run fine with 2 threads.

I get the longjmp causes uninitialized stack frame error when I try to run C192 enkf forecast (gdasefcs) jobs on WCOSS2 with 2 threads and 64ppn (current develop settings):

fails: mpiexec -l -n 208 -ppn 64 --cpu-bind depth --depth 2, <nodes>4:ppn=64:tpp=2</nodes> works: mpiexec -l -n 208 -ppn 128 --cpu-bind depth --depth 1, <nodes>2:ppn=128:tpp=1</nodes>

log (2 threads - fails): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasefcs01.log.0 log (1 thread - works): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasefcs01.log

junwang-noaa commented 2 years ago

@KateFriedman-NOAA I'd suggest staying with current number of tasks (not using the large number of nodes).

SMoorthi-emc commented 2 years ago

Please do not use more resources than needed to run the model at about 8min/day. This will be a problem as we increase resolution, if you make C768 run too fast. We do not want to shrink the operational window.

On Mon, Oct 3, 2022 at 2:33 PM Kate Friedman @.***> wrote:

150:ppn=24:tpp=5 mpiexec -l -n 3456 -ppn 5 --cpu-bind depth --depth 5

Question for both @junwang-noaa https://github.com/junwang-noaa and @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA : For developer runs, do we want to use the larger number of nodes for C768 on WCOSS2 now or stick with the current settings for now? The current settings are what are set for the R&Ds for that resolution right now. Many developers will likely run shorter forecasts so fewer resources now would be ok. Definitely need to use more nodes and optimize ahead of hand-off of v17.

This so far runs robustly in about 8100 wallclock seconds.

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA What forecast length is that timing? 384hrs?

The 384hr gfsfcst jobs in my real-time run (44:ppn=32:tpp=4, mpiexec -l -n 1408 -ppn 32 --cpu-bind depth --depth 4) take ~19,000s. See column B in the "gfs" sheet in this document for my timings:

https://docs.google.com/spreadsheets/d/1bc0pLToSFGmiFTfPIS-w3_-jxOrBvKlonEJUDvchShE/edit#gid=2127680143

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/419#issuecomment-1265867466, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALLVRYRRZ6KBJJJREJXWEJLWBMRGDANCNFSM5CMEHY5A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718

e-mail: @.*** Phone: (301) 683-3718 Fax: (301) 683-3718

KateFriedman-NOAA commented 2 years ago

I'd suggest staying with current number of tasks (not using the large number of nodes).

Noted, will stay with current # of tasks. Thanks Jun!

Please do not use more resources than needed to run the model at about 8min/day.

Definitely not looking to speed up past the 8min/day, so no worries @SMoorthi-emc ! Staying with fewer nodes until we optimize for ops. :)

GeorgeVandenberghe-NOAA commented 2 years ago

I only did this to replicate the prod environment

On Monday, October 3, 2022, Kate Friedman @.***> wrote:

I'd suggest staying with current number of tasks (not using the large number of nodes).

Noted, will stay with current # of tasks. Thanks Jun!

Please do not use more resources than needed to run the model at about 8min/day.

Definitely not looking to speed up past the 8min/day, so no worries @SMoorthi-emc ! Staying with fewer nodes until we optimize for ops. :)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.< https://ci4.googleusercontent.com/proxy/CpSjZXcP8Nlbd-K1NMubstt1lXKajIpzrton7EZ4j9pKVjMKuw6F-oBQcTfr2VJzIW3wqPGXdZmMBKvU4q7G_WiGX-VkIuxo_dGnWLnbqF8QgPqo6zNQVeWQCfazSGzLL073tHRsyRB2wz2GcUML2WJjfzscF48Kcj9E7Uwyo3SFBKWZPVXAGRIA-0rsaTXT_hyBP2e9374PpzurbIt6hDA1BOy-pgIKYb8Kr-d7TA=s0-d-e1-ft#https://github.com/notifications/beacon/ANDS4FUUAVBVCRBHV763S63WBMU5JA5CNFSM5CMEHY5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOJN2BWQA.gif>Message ID: @.***>

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

KateFriedman-NOAA commented 2 years ago

@junwang-noaa @GeorgeVandenberghe-NOAA Another C384 issue on WCOSS2...this time revolving around parallel netcdf. C384 fails with parallel netcdf on WCOSS2. Should we run the deterministic C384 with serial netcdf like we run the ensemble C384 forecast jobs in GFSv16 ops?

This was observed when porting GFSv16 ops to WCOSS2 earlier this year. The C384 enkf forecast jobs in GFSv16 would sometimes fail at parallel netcdf with the HDF5 error or succeed but produce corrupted output and fail in the epos jobs. The workaround for ops was to run the C384 enkf forecast jobs with serial netcdf (adjusting the job resources to speed it up to fit in the ops timing window). Another solution was to run with parallel netcdf and zero chunking...but downstream models reading GFS output generated with zero chunking were slowed down significantly so this solution was backed out in favor of running the enkf C384 forecast jobs with serial netcdf.

For running GFSv17 on WCOSS2 I set the C384 enkf forecast jobs (in a C768C384L127 test) to use serial netcdf from the start.

When testing the C384C192L127 resolution combo, the C384 deterministic fcst jobs also fail with parallel netcdf but succeeded upon a rerun with serial netcdf (no other changes). C768 deterministic jobs have not encountered issues running parallel netcdf (in either GFSv16 or GFSv17). All lower resolutions (C192, C96, etc.) run successfully with serial netcdf (as is set in develop config.fv3 currently.

The error in the log:

nid001981.cactus.wcoss2.ncep.noaa.gov 384: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc'  error='Bad address'  errno=14  PE=00384  W_rec=00552  off=0032105946  len=0000020612  See MPICH_MPIIO_ABORT_ON_RW_ERROR.
nid001981.cactus.wcoss2.ncep.noaa.gov 384:  file: module_write_netcdf.F90 line:          450 NetCDF: HDF error
nid001981.cactus.wcoss2.ncep.noaa.gov 384: MPICH Notice [Rank 384] [job id 0f16ae36-bba6-4b7a-9c71-283375164aef] [Wed Sep 28 13:51:45 2022] [nid001981] - nid001981.cactus.wcoss2.ncep.noaa.gov 384: Abort(1) (rank 384 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 384

log (failed): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log.0 log (succeeded): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log

kate.friedman@clogin05> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log.0
+++ config.fv3[179]: export OUTPUT_FILETYPE_ATM=netcdf_parallel
+++ config.fv3[179]: OUTPUT_FILETYPE_ATM=netcdf_parallel
+++ config.fv3[180]: export OUTPUT_FILETYPE_SFC=netcdf_parallel
+++ config.fv3[180]: OUTPUT_FILETYPE_SFC=netcdf_parallel
kate.friedman@clogin05> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log
+++ config.fv3[178]: export OUTPUT_FILETYPE_ATM=netcdf
+++ config.fv3[178]: OUTPUT_FILETYPE_ATM=netcdf
+++ config.fv3[179]: export OUTPUT_FILETYPE_SFC=netcdf
+++ config.fv3[179]: OUTPUT_FILETYPE_SFC=netcdf

I reattempted to run the C384 deterministic forecast jobs with parallel netcdf in my test's next cycle (first full cycle). The gfsfcst job (using parallel netcdf) failed twice. Same error:

nid001681.cactus.wcoss2.ncep.noaa.gov 576:  actual    inline post Time is    7.56241 at Fcst   00:00
 ichunk2d,jchunk2d        1536         768
 ichunk3d,jchunk3d,kchunk3d        1536         768           1
 in wrt run,filename=            1 atmf000.nc
nid001681.cactus.wcoss2.ncep.noaa.gov 576: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc'  error='Bad address'  errno=14  PE=00576  W_rec=00552  off=0032081514  len=0000020612  See MPICH_MPIIO_ABORT_ON_RW_ERROR.
nid001681.cactus.wcoss2.ncep.noaa.gov 576:  file: module_write_netcdf.F90 line:          450 NetCDF: HDF error
nid001681.cactus.wcoss2.ncep.noaa.gov 576: MPICH Notice [Rank 576] [job id 903fbdee-e6f4-4ad5-b326-b3d4081ca701] [Wed Sep 28 16:21:24 2022] [nid001681] - Abort(1) (rank 576 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 576

nid001564.cactus.wcoss2.ncep.noaa.gov: rank 266 exited with code 1
nid001672.cactus.wcoss2.ncep.noaa.gov 512: forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
ufs_model.x        000000000524F73B  Unknown               Unknown  Unknown

Logs: /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.2 /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.1

I resubmitted the gfsfcst job again, still with parallel netcdf but now also setting the following in the fcst block of env/WCOSS2.env:

    export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
    export FI_OFI_RXM_SAR_LIMIT=3145728

The gfsfcst job made it past the prior points of failure and completed successfully using parallel netcdf. I have let the deterministic C384 forecast jobs in my C384C192L127 test continue using these settings via WCOSS2.env to see if it continued working in other instances (it has so far):

elif [ $step = "fcst" ]; then

    export OMP_PLACES=cores
    export OMP_STACKSIZE=2048M
    if [ $CASE = "C384" ]; then
      export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
      export FI_OFI_RXM_SAR_LIMIT=3145728
    fi
    export FI_OFI_RXM_RX_SIZE=40000
    export FI_OFI_RXM_TX_SIZE=40000

Those settings do not work with C768 forecast jobs on WCOSS2 however, so I have not committed them to the workflow branch yet.

The MPICH_MPIIO_HINTS and FI_OFI_RXM_SAR_LIMIT values were copied from the efcs block of env/WCOSS2.env and were found previously to be useful with the C384 enkf forecast (efcs) jobs during the GFSv16 port to WCOSS2. The following runtime settings are already set and used for both the fcst and efcs jobs:

    export FI_OFI_RXM_RX_SIZE=40000
    export FI_OFI_RXM_TX_SIZE=40000

The quick and easy solution is to run deterministic C384 forecast jobs with serial netcdf on WCOSS2, similar to what we do in GFSv16 ops enkf forecast jobs currently. I'm interested to see if the newer HDF5 library version helps with this.

GeorgeVandenberghe-NOAA commented 2 years ago

We should wait until we get the new HDF5/1.12.2 library on WCOSS2 built by admins. Pending that we have a known issue with HDF5/1.10.6 in C384 parallel Netcdf and should just stick with serial netcdf without compression. I haven't checked but suspect compression will slow it down enough we need too many I/O groups if run serially. The problem with uncompressed Netcdf I/O is it makes a MUCH bigger disk image but we can't for now upgrade to HDF5/1.12.2 so that's our only option.. just tell management we'll be using a lot more disk.

On Mon, Oct 3, 2022 at 7:21 PM Kate Friedman @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Another C384 issue on WCOSS2...this time revolving around parallel netcdf. C384 fails with parallel netcdf on WCOSS2. Should we run the deterministic C384 with serial netcdf like we run the ensemble C384 forecast jobs in GFSv16 ops?

This was observed when porting GFSv16 ops to WCOSS2 earlier this year. The C384 enkf forecast jobs in GFSv16 would sometimes fail at parallel netcdf with the HDF5 error or succeed but produce corrupted output and fail in the epos jobs. The workaround for ops was to run the C384 enkf forecast jobs with serial netcdf (adjusting the job resources to speed it up to fit in the ops timing window). Another solution was to run with parallel netcdf and zero chunking...but downstream models reading GFS output generated with zero chunking were slowed down significantly so this solution was backed out in favor of running the enkf C384 forecast jobs with serial netcdf.

For running GFSv17 on WCOSS2 I set the C384 enkf forecast jobs (in a C768C384L127 test) to use serial netcdf from the start.

When testing the C384C192L127 resolution combo, the C384 deterministic fcst jobs also fail with parallel netcdf but succeeded upon a rerun with serial netcdf (no other changes). C768 deterministic jobs have not encountered issues running parallel netcdf (in either GFSv16 or GFSv17). All lower resolutions (C192, C96, etc.) run successfully with serial netcdf (as is set in develop config.fv3 currently.

The error in the log:

nid001981.cactus.wcoss2.ncep.noaa.gov 384: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc' error='Bad address' errno=14 PE=00384 W_rec=00552 off=0032105946 len=0000020612 See MPICH_MPIIO_ABORT_ON_RW_ERROR.nid001981.cactus.wcoss2.ncep.noaa.gov 384: file: module_write_netcdf.F90 line: 450 NetCDF: HDF errornid001981.cactus.wcoss2.ncep.noaa.gov 384: MPICH Notice [Rank 384] [job id 0f16ae36-bba6-4b7a-9c71-283375164aef] [Wed Sep 28 13:51:45 2022] [nid001981] - nid001981.cactus.wcoss2.ncep.noaa.gov 384: Abort(1) (rank 384 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 384

log (failed): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log.0 log (succeeded): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log

@.> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log.0 +++ config.fv3[179]: export OUTPUT_FILETYPE_ATM=netcdf_parallel +++ config.fv3[179]: OUTPUT_FILETYPE_ATM=netcdf_parallel +++ config.fv3[180]: export OUTPUT_FILETYPE_SFC=netcdf_parallel +++ config.fv3[180]: OUTPUT_FILETYPE_SFC=netcdf_parallel @.> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log +++ config.fv3[178]: export OUTPUT_FILETYPE_ATM=netcdf +++ config.fv3[178]: OUTPUT_FILETYPE_ATM=netcdf +++ config.fv3[179]: export OUTPUT_FILETYPE_SFC=netcdf +++ config.fv3[179]: OUTPUT_FILETYPE_SFC=netcdf

I reattempted to run the C384 deterministic forecast jobs with parallel netcdf in my test's next cycle (first full cycle). The gfsfcst job (using parallel netcdf) failed twice. Same error:

nid001681.cactus.wcoss2.ncep.noaa.gov 576: actual inline post Time is 7.56241 at Fcst 00:00 ichunk2d,jchunk2d 1536 768 ichunk3d,jchunk3d,kchunk3d 1536 768 1 in wrt run,filename= 1 atmf000.ncnid001681.cactus.wcoss2.ncep.noaa.gov 576: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc' error='Bad address' errno=14 PE=00576 W_rec=00552 off=0032081514 len=0000020612 See MPICH_MPIIO_ABORT_ON_RW_ERROR.nid001681.cactus.wcoss2.ncep.noaa.gov 576: file: module_write_netcdf.F90 line: 450 NetCDF: HDF errornid001681.cactus.wcoss2.ncep.noaa.gov 576: MPICH Notice [Rank 576] [job id 903fbdee-e6f4-4ad5-b326-b3d4081ca701] [Wed Sep 28 16:21:24 2022] [nid001681] - Abort(1) (rank 576 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 576 nid001564.cactus.wcoss2.ncep.noaa.gov: rank 266 exited with code 1nid001672.cactus.wcoss2.ncep.noaa.gov 512: forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source ufs_model.x 000000000524F73B Unknown Unknown Unknown

Logs:

/lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.2

/lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.1

I resubmitted the gfsfcst job again, still with parallel netcdf but now also setting the following in the fcst block of env/WCOSS2.env:
export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
export FI_OFI_RXM_SAR_LIMIT=3145728
The gfsfcst job made it past the prior points of failure and completed successfully using parallel netcdf. I have let the deterministic C384 forecast jobs in my C384C192L127 test continue using these settings via WCOSS2.env to see if it continued working in other instances (it has so far):

elif [ $step = "fcst" ]; then
export OMP_PLACES=cores
export OMP_STACKSIZE=2048M
if [ $CASE = "C384" ]; then
  export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
  export FI_OFI_RXM_SAR_LIMIT=3145728
fi
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
Those settings do not work with C768 forecast jobs on WCOSS2 however, so I have not committed them to the workflow branch yet.

The MPICH_MPIIO_HINTS and FI_OFI_RXM_SAR_LIMIT values were copied from the efcs block of env/WCOSS2.env and were found previously to be useful with the C384 enkf forecast (efcs) jobs during the GFSv16 port to WCOSS2. The following runtime settings are already set and used for both the fcst and efcs jobs:
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
The quick and easy solution is to run deterministic C384 forecast jobs with serial netcdf on WCOSS2, similar to what we do in GFSv16 ops enkf forecast jobs currently. I'm interested to see if the newer HDF5 library version helps with this.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/419#issuecomment-1265922917, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQXNCLMPEJO5AJLNJDWBMW4RANCNFSM5CMEHY5A . You are receiving this because you were mentioned.Message ID: @.***>

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA commented 2 years ago

When running this workflow what is the definition of "deterministic forecast" There is a set of these for the enkf members run out to 9 hours and another set of these for the ensemble forecasts run out to 390 hours. Which ones are you referring to and what are the failure patterns of the two sets? Are we seeing the hdf failures in the long ensemble forecasts on read, going out to 390 hours or are these also writing bad netcdf files.

Also are the 9 hour ones hanging occasionally, the 390 hour ones hanging occasionally? or both.

On Mon, Oct 3, 2022 at 7:21 PM Kate Friedman @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Another C384 issue on WCOSS2...this time revolving around parallel netcdf. C384 fails with parallel netcdf on WCOSS2. Should we run the deterministic C384 with serial netcdf like we run the ensemble C384 forecast jobs in GFSv16 ops?

This was observed when porting GFSv16 ops to WCOSS2 earlier this year. The C384 enkf forecast jobs in GFSv16 would sometimes fail at parallel netcdf with the HDF5 error or succeed but produce corrupted output and fail in the epos jobs. The workaround for ops was to run the C384 enkf forecast jobs with serial netcdf (adjusting the job resources to speed it up to fit in the ops timing window). Another solution was to run with parallel netcdf and zero chunking...but downstream models reading GFS output generated with zero chunking were slowed down significantly so this solution was backed out in favor of running the enkf C384 forecast jobs with serial netcdf.

For running GFSv17 on WCOSS2 I set the C384 enkf forecast jobs (in a C768C384L127 test) to use serial netcdf from the start.

When testing the C384C192L127 resolution combo, the C384 deterministic fcst jobs also fail with parallel netcdf but succeeded upon a rerun with serial netcdf (no other changes). C768 deterministic jobs have not encountered issues running parallel netcdf (in either GFSv16 or GFSv17). All lower resolutions (C192, C96, etc.) run successfully with serial netcdf (as is set in develop config.fv3 currently.

The error in the log:

nid001981.cactus.wcoss2.ncep.noaa.gov 384: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc' error='Bad address' errno=14 PE=00384 W_rec=00552 off=0032105946 len=0000020612 See MPICH_MPIIO_ABORT_ON_RW_ERROR.nid001981.cactus.wcoss2.ncep.noaa.gov 384: file: module_write_netcdf.F90 line: 450 NetCDF: HDF errornid001981.cactus.wcoss2.ncep.noaa.gov 384: MPICH Notice [Rank 384] [job id 0f16ae36-bba6-4b7a-9c71-283375164aef] [Wed Sep 28 13:51:45 2022] [nid001981] - nid001981.cactus.wcoss2.ncep.noaa.gov 384: Abort(1) (rank 384 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 384

log (failed): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log.0 log (succeeded): /lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010118/gdasfcst.log

@.> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log.0 +++ config.fv3[179]: export OUTPUT_FILETYPE_ATM=netcdf_parallel +++ config.fv3[179]: OUTPUT_FILETYPE_ATM=netcdf_parallel +++ config.fv3[180]: export OUTPUT_FILETYPE_SFC=netcdf_parallel +++ config.fv3[180]: OUTPUT_FILETYPE_SFC=netcdf_parallel @.> grep OUTPUT_FILETYPE SAVE_LOGS/logs/2022010118/gdasfcst.log +++ config.fv3[178]: export OUTPUT_FILETYPE_ATM=netcdf +++ config.fv3[178]: OUTPUT_FILETYPE_ATM=netcdf +++ config.fv3[179]: export OUTPUT_FILETYPE_SFC=netcdf +++ config.fv3[179]: OUTPUT_FILETYPE_SFC=netcdf

I reattempted to run the C384 deterministic forecast jobs with parallel netcdf in my test's next cycle (first full cycle). The gfsfcst job (using parallel netcdf) failed twice. Same error:

nid001681.cactus.wcoss2.ncep.noaa.gov 576: actual inline post Time is 7.56241 at Fcst 00:00 ichunk2d,jchunk2d 1536 768 ichunk3d,jchunk3d,kchunk3d 1536 768 1 in wrt run,filename= 1 atmf000.ncnid001681.cactus.wcoss2.ncep.noaa.gov 576: ADIOI_CRAY_WRITECONTIG(243): filename='atmf000.nc' error='Bad address' errno=14 PE=00576 W_rec=00552 off=0032081514 len=0000020612 See MPICH_MPIIO_ABORT_ON_RW_ERROR.nid001681.cactus.wcoss2.ncep.noaa.gov 576: file: module_write_netcdf.F90 line: 450 NetCDF: HDF errornid001681.cactus.wcoss2.ncep.noaa.gov 576: MPICH Notice [Rank 576] [job id 903fbdee-e6f4-4ad5-b326-b3d4081ca701] [Wed Sep 28 16:21:24 2022] [nid001681] - Abort(1) (rank 576 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 576 nid001564.cactus.wcoss2.ncep.noaa.gov: rank 266 exited with code 1nid001672.cactus.wcoss2.ncep.noaa.gov 512: forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source ufs_model.x 000000000524F73B Unknown Unknown Unknown

Logs:

/lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.2

/lfs/h2/emc/global/noscrub/kate.friedman/expdir/devcyc384a/SAVE_LOGS/2022010200/gfsfcst.log.1

I resubmitted the gfsfcst job again, still with parallel netcdf but now also setting the following in the fcst block of env/WCOSS2.env:
export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
export FI_OFI_RXM_SAR_LIMIT=3145728
The gfsfcst job made it past the prior points of failure and completed successfully using parallel netcdf. I have let the deterministic C384 forecast jobs in my C384C192L127 test continue using these settings via WCOSS2.env to see if it continued working in other instances (it has so far):

elif [ $step = "fcst" ]; then
export OMP_PLACES=cores
export OMP_STACKSIZE=2048M
if [ $CASE = "C384" ]; then
  export MPICH_MPIIO_HINTS="*:romio_cb_write=disable"
  export FI_OFI_RXM_SAR_LIMIT=3145728
fi
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
Those settings do not work with C768 forecast jobs on WCOSS2 however, so I have not committed them to the workflow branch yet.

The MPICH_MPIIO_HINTS and FI_OFI_RXM_SAR_LIMIT values were copied from the efcs block of env/WCOSS2.env and were found previously to be useful with the C384 enkf forecast (efcs) jobs during the GFSv16 port to WCOSS2. The following runtime settings are already set and used for both the fcst and efcs jobs:
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
The quick and easy solution is to run deterministic C384 forecast jobs with serial netcdf on WCOSS2, similar to what we do in GFSv16 ops enkf forecast jobs currently. I'm interested to see if the newer HDF5 library version helps with this.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/419#issuecomment-1265922917, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQXNCLMPEJO5AJLNJDWBMW4RANCNFSM5CMEHY5A . You are receiving this because you were mentioned.Message ID: @.***>

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

KateFriedman-NOAA commented 2 years ago

When running this workflow what is the definition of "deterministic forecast"

When I say "deterministic forecast" I am referring to the gdas and gfs forecast jobs and separating them from the ensemble/enkf forecast jobs.

Let me summarize current issues/status based on resolution:

gdasfcst (9hr deterministic)
- issues with parallel netcdf when C384
- parallel netcdf ok when C768
- run serial netcdf C192 or lower
gfsfcst (384hr deterministic)
- issues with parallel netcdf when C384
- parallel netcdf ok when C768
- run serial netcdf C192 or lower
gdasefcs (9hrs x 80 members enkf/ensemble - not GEFS)
- C192 fails with 2 threads (works with 1; deterministic C192 is fine with 2 threads)
- occasionally hangs at C384 serial (reruns always work)
- haven't tried C384 parallel netcdf since the GFSv16 port issues (corrupted files sometimes - need newer HDF5 to try)
- run serial netcdf C192 or lower
- C96 is fine on WCOSS2

KateFriedman-NOAA commented 1 year ago

Deferring the introduction of version files in develop to the future. See issue #671. Closing this issue now.