NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

zeros in ocean post grib2 files on hera #2615

Closed JessicaMeixner-NOAA closed 1 week ago

JessicaMeixner-NOAA commented 1 month ago

What is wrong?

When running with the sea-ice PR that was just merged, so essentially develop as of today, it was noticed by @SulagnaRay-NOAA that all of the ocean grib2 files are constant values (mostly zeros). The native model output is not zeros and the ice gribs also appear to be okay.

Investigation as to what is going on and why is ongoing.

What should have happened?

We should have grib2 output files that match the native model output (and have non-zero/constant values).

What machines are impacted?

Hera

Steps to reproduce

This was discovered running a C384 test case of C384mx025_3DVarAOWCDA. However, I suspect other test cases would expose this issue as well.

Some example output can be found here: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/gfs.20210703/06/products/ocean/grib2/0p25

Log files can be found here: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/logs/2021070306

Additional information

@GwenChen-NOAA @jiandewang @SulagnaRay-NOAA @LydiaStefanova-NOAA @guillaumevernieres @CatherineThomas-NOAA FYI - any additional information or help is appreciated!

Do you have a proposed solution?

Not yet...

jiandewang commented 1 month ago

I compared tripole.mx025.Ct.to.rect.1p00.conserve.nc between the two, looks there is a 360 offset between them: xc_a = -299.718339695101, -299.47037035674, -299.22239891217 <-- HERA xc_a = 60.2816603048989, 60.5296296432605, 60.7776010878256 <--wcoss2

jiandewang commented 1 month ago

@EricSinsky-NOAA can we re-run the exectuable here offline ? /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181 what kind of module do we need to load ? I tried but got error ./ocnicepost.x: symbol lookup error: ./ocnicepost.x: undefined symbol: netcdf_mp_nf90open

but I do have netcdf4 and hdf5 module loaded

EricSinsky-NOAA commented 1 month ago

@jiandewang Good find. It looks like there is a 360 offset between the 20231219 version and the 20240416 version of these fix files. These can be both found on Hera:

Version used in HR3 (20231219): /scratch1/NCEPDEV/global/glopara/fix/mom6/20231219/post/mx025/tripole.mx025.Ct.to.rect.1p00.conserve.nc

Newer version (20240416): /scratch1/NCEPDEV/global/glopara/fix/mom6/20240416/post/mx025/tripole.mx025.Ct.to.rect.1p00.conserve.nc

EricSinsky-NOAA commented 1 month ago

@EricSinsky-NOAA can we re-run the exectuable here offline ? /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181 what kind of module do we need to load ? I tried but got error ./ocnicepost.x: symbol lookup error: ./ocnicepost.x: undefined symbol: netcdf_mp_nf90open

but I do have netcdf4 and hdf5 module loaded

I have ran ocnicepost.x offline before, but it has been a couple of months.

EricSinsky-NOAA commented 1 month ago

@jiandewang I would start by executing source ush/load_fv3gfs_modules.sh before running ocnicepost.x offline.

jiandewang commented 1 month ago

@EricSinsky-NOAA what's wrong in what I did below ? why it added an extar "/" before "ush"

cd /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/global-workflow source ush/load_fv3gfs_modules.sh Loading modules quietly... -bash: /ush/detect_machine.sh: No such file or directory -bash: /ush/module-setup.sh: No such file or directory -bash: /versions/run.ver: No such file or directory WARNING: UNKNOWN PLATFORM No modules loaded

EricSinsky-NOAA commented 1 month ago

@jiandewang I am getting the same error too when I try to load modules using load_fv3gfs_modules.sh. However, I did a quick test in /lfs/h2/emc/stmp/eric.sinsky/RUNDIRS/gw_ocnbugfix2/oceanice_products.242828 and was able to execute ocnicepost.x offline. These are the modules I have loaded image

WalterKolczynski-NOAA commented 1 month ago

@EricSinsky-NOAA what's wrong in what I did below ? why it added an extar "/" before "ush"

cd /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/global-workflow source ush/load_fv3gfs_modules.sh Loading modules quietly... -bash: /ush/detect_machine.sh: No such file or directory -bash: /ush/module-setup.sh: No such file or directory -bash: /versions/run.ver: No such file or directory WARNING: UNKNOWN PLATFORM No modules loaded

Do this first:

export HOMEgfs="/scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/global-workflow"
jiandewang commented 1 month ago

@jiandewang I am getting the same error too when I try to load modules using load_fv3gfs_modules.sh. However, I did a quick test in /lfs/h2/emc/stmp/eric.sinsky/RUNDIRS/gw_ocnbugfix2/oceanice_products.242828 and was able to execute ocnicepost.x offline. These are the modules I have loaded image

@EricSinsky-NOAA can you copy and paste your module list here so that I can do copy and paste ?

EricSinsky-NOAA commented 1 month ago

craype-x86-rome libfabric/1.11.0.0. craype-network-ofi envvar/1.0 intel/19.1.3.304 PrgEnv-intel/8.1.0 imagemagick/7.0.8-7 subversion/1.14.0 libjpeg/9c grib_util/1.2.2 wgrib2/2.0.8_wmo GrADS/2.2.2 ecflow/5.6.0.11 cdo/1.9.8 udunits/2.2.28 ncview/2.1.7 python/3.8.6 proj/7.1.0 geos/3.8.1 prod_util/2.0.14 w3nco/2.4.1 core/rocoto/1.3.5 hdf5/1.10.6 netcdf/4.7.4

jiandewang commented 1 month ago

@EricSinsky-NOAA I see you are testing on wcoss2. Can you repeat your testing on HERA but use the following as a template ? /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181

EricSinsky-NOAA commented 1 month ago

@jiandewang I just ran ocnicepost.x offline on Hera using your template. The interpolated output can be found here: /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/ocean.0p25.nc

jiandewang commented 1 month ago

@jiandewang I just ran ocnicepost.x offline on Hera using your template. The interpolated output can be found here: /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/ocean.0p25.nc

can you share me your module list on HERA ?

also can you replace fixed file with /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/fixed-file-wcoss2 and re-run it ?

WalterKolczynski-NOAA commented 1 month ago

@jiandewang I just ran ocnicepost.x offline on Hera using your template. The interpolated output can be found here: /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/ocean.0p25.nc

can you share me your module list on HERA ?

also can you replace fixed file with /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/fixed-file-wcoss2 and re-run it ?

@jiandewang If you export HOMEgfs first (see above), load_fv3gfs_modules.sh should work

jiandewang commented 1 month ago

@jiandewang I just ran ocnicepost.x offline on Hera using your template. The interpolated output can be found here: /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/ocean.0p25.nc

can you share me your module list on HERA ? also can you replace fixed file with /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/fixed-file-wcoss2 and re-run it ?

@jiandewang If you export HOMEgfs first (see above), load_fv3gfs_modules.sh should work

@WalterKolczynski-NOAA no more module loading error after I did export HOMEgfs=.... Thanks

EricSinsky-NOAA commented 1 month ago

Thanks @WalterKolczynski-NOAA. Adding HOMEgfs to my environment allowed me to successfully execute load_fv3gfs_modules.sh.

EricSinsky-NOAA commented 1 month ago

@jiandewang After replacing the fix files with /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/fixed-file-wcoss2 and rerunning, I am still getting all zeroes.

JessicaMeixner-NOAA commented 1 month ago

My test run of C48 on wcoss2 did not do well: /lfs/h2/emc/couple/noscrub/jessica.meixner/testoceanpost/hr3/test01/COMROOT/c48t01/gfs.20210323/12/products/ocean/grib2/5p00

EricSinsky-NOAA commented 1 month ago

Thank you, @JessicaMeixner-NOAA. It sounds like this might be an issue with the build of ocnicepost.x on WCOSS2 and Hera. @jiandewang When you ran your HR3 test and you got reasonable interpolated ocean output, did you rebuild ocnicepost.x (as well as the other executables related to HR3) during your test?

jiandewang commented 1 month ago

Thank you, @JessicaMeixner-NOAA. It sounds like this might be an issue with the build of ocnicepost.x on WCOSS2 and Hera. @jiandewang When you ran your HR3 test and you got reasonable interpolated ocean output, did you rebuild ocnicepost.x (as well as the other executables related to HR3) during your test?

no I just used my original several month ago's *.x

JessicaMeixner-NOAA commented 1 month ago

I did a new build, but I did have an old build too... I'll try the 0.25 case w/the new build and I'll also try using my old build on a C48 case and see what happens.

JessicaMeixner-NOAA commented 1 month ago

Update:

Therefore, I think there are likely issues with all of the 5 deg cases and so we should not be using that to see if things are working or not.

EricSinsky-NOAA commented 1 month ago

@JessicaMeixner-NOAA Glad to see you are getting non-zeroes for C768mx025. Were the C768mx025 test cases also based on the HR3 tag (not just the C48mx500 test case)? Also did you run the C768mx025 test case using both your old build and new build too?

Also, I ran an old version of ocnicepost offline. I got non-zeroes in the interpolated NetCDF output. In this test, however, the resolution of the NetCDF input (MOM6) data was mx025.

JessicaMeixner-NOAA commented 1 month ago

@EricSinsky-NOAA It is nice to see some non-zero values, for sure!!

The tests I ran with the HR3 tag, I ran both the old build and the new build and both had non-zeros.

EricSinsky-NOAA commented 1 month ago

This is my understanding on what we know so far:

JessicaMeixner-NOAA commented 1 month ago

@EricSinsky-NOAA I'd say that we get zero's with the newest hashes, where the mx025 issues come in between now and HR3 tag is an open question I think, since most of our previous testing was based on mx500, I'm not sure we have a lot of information about the in-between parts. I'm going to run a few tests on WCOSS2 to see if we can narrow down issues there.

aerorahul commented 1 month ago

Thank you @EricSinsky-NOAA for the summary and @JessicaMeixner-NOAA for the additional information.

A few questions:

I'ld say we need to find a baseline that works first; I think we have that for C768mx025 case with the HR3 tag. Unfortunately C48mx500 with the HR3 tag resulted in zeros.

JessicaMeixner-NOAA commented 1 month ago

For the HR3 tag on WCOSS2 the mom6 fix files are: mom6 -> /lfs/h2/emc/global/noscrub/emc.global/FIX/fix/mom6/20231219

I'm currently trying to test the commit before the fix file change on wcoss2 with mx025 to see if that works. I did find an experiment on hera that a case using the old fix files and mx025 still gave me zeros...

JessicaMeixner-NOAA commented 1 month ago

I ran with mx025 on WCOSS2 for commit hashes https://github.com/NOAA-EMC/global-workflow/commit/6ca106e6c0466d7165fc37b147e0e2735a1d6a0b and https://github.com/NOAA-EMC/global-workflow/commit/d5366c66bd67f89d118b18956fe230207cbf0aea (the one that changed the mom6 fix) and they both give me non-zero output for the grib2 files....

I can share paths if that's helpful. Has anyone tried anything mx025 on orion?

JessicaMeixner-NOAA commented 1 month ago

So some random thoughts before the weekend:

EricSinsky-NOAA commented 1 month ago
  • Did we ever confirm that the reason for the diffs between wcoss2 and hera that @jiandewang saw were because of version numbers or were there actually differences?

@JessicaMeixner-NOAA The diffs between WCOSS2 and Hera are because the comparisons were between two different versions of the fix files. The fix files being compared from WCOSS2 are the 20231219 version, while the fix files being compared from Hera are the 20240416 version. Both fix file versions exist on both WCOSS2 and Hera. When the fix files of the same version are compared between WCOSS2 and Hera, the file sizes are identical.

JessicaMeixner-NOAA commented 1 month ago

@EricSinsky-NOAA thanks for confirming that!

jiandewang commented 1 month ago

some further testing results: (1) The fix files 20231219 version vs 20240416 version: there is a 360 degree offset in longitute between them. The results generated by them are not identical but differences are on roundoff level (~E-8). So this is not the reason for the zero value in regular grid file.

(2) in HR3 run on wcoss2 which gave us correct results, ocean master files are on 40 levels. However in Jessica's HERA run (/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181) and Eric's run, ocean.nc are on 75 levels because you are setting as DA

see https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L454-L459 I used Jessica's run dir as template but replaced ocean.nc by the one from HR3 run (40L), then it generated correct regular grid file.

jiandewang commented 1 month ago

more testing results: It is the missing value that messed up the results. In HR3 run it is -e34 while in DA it is set as 0. After I re-set missing value to -e34 in ocean.nc from Jessica's run dir, the interpolated results are correct. I think this missing value is embeded in fixed files when they were generated using one of previous HRx run output where it is -e34. I did my test on wcoss2. Somehow I had trouble to run it on HERA due to module loading.

@EricSinsky-NOAA : you may repeat your run but use my modified input file at /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/ceanice_products.3448181-JM/NCO2/ocean.nc-JM-75L-E34 or you can simply repeat your C48mx500 run but set https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L456C9-L456C31 as -e34

EricSinsky-NOAA commented 1 month ago

@jiandewang Thank you very much for finding the issue! I just ran the C48_S2SWA_gefs CI test case (MOM6 is set to mx500) using the most recent hash. I have set MOM6_DIAG_MISVAL to -1e34 in parm/config/gefs/config.ufs and this fixed the issue (non-zeroes in the interpolated ocean output).

EDIT: My test was on WCOSS2.

JessicaMeixner-NOAA commented 1 month ago

The exception value will need to be resolved with @guillaumevernieres and others, as DA might need the missing value to be set as 0.

@jiandewang what module issues did you have on hera? I was curious on Friday if we had module mis-match issues as a possible issue.

jiandewang commented 1 month ago

@JessicaMeixner-NOAA I followed Walter's method (the g-w I used is the cycle one you asked me to run). No error pop out after I did source ush/......... but when I ran ocnicepost.x it crashed at writing 3D mask file.

jiandewang commented 1 month ago

a quick and dirty solution: apply this command in the script after DA ocean files being generated: ncatted -a missing_value,,m,f,-1E34 that will make oceanpost happy

DeniseWorthen commented 1 month ago

Apologies for being late to the party. Am I understanding that the missing value is defined as 0.0 in the history file? A missing value of 0.0 makes no sense to me, since it is also a valid value. How do you distinguish where Temp=0 because it really is 0.0C and where it is 0 because it is a land point?

jiandewang commented 1 month ago

Apologies for being late to the party. Am I understanding that the missing value is defined as 0.0 in the history file? A missing value of 0.0 makes no sense to me, since it is also a valid value. How do you distinguish where Temp=0 because it really is 0.0C and where it is 0 because it is a land point?

@DeniseWorthen see https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L456C9-L456C31

DeniseWorthen commented 1 month ago

@jiandewang Thanks, but that doesn't answer my question really. How is a missing value of 0.0 being distinguished from a physical value of 0.0?

guillaumevernieres commented 1 month ago

@jiandewang Thanks, but that doesn't answer my question really. How is a missing value of 0.0 being distinguished from a physical value of 0.0?

@DeniseWorthen , you just don't construct your mask based on the fill value.

DeniseWorthen commented 1 month ago

@guillaumevernieres Thanks. So where does your mask come from?

edit: I mean, which file? Are you retrieving it from the model output or are you using something else?

guillaumevernieres commented 1 month ago

@guillaumevernieres Thanks. So where does your mask come from?

edit: I mean, which file? Are you retrieving it from the model output or are you using something else?

We use the mom6 grid generation functionality but this is overkill for this issue. The mask could simply be constructed using the layer thicknesses.

JessicaMeixner-NOAA commented 3 weeks ago

A PR has been created so that for GFS or GEFS versus GDAS/ENKF we have different exception values and number of layers for MOM6. This should be able to resolve this problem, although in the future, it might be good to still explore updating how the mask is defined in the ocean post.