NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

zeros in ocean post grib2 files on hera #2615

Closed JessicaMeixner-NOAA closed 1 week ago

JessicaMeixner-NOAA commented 1 month ago

What is wrong?

When running with the sea-ice PR that was just merged, so essentially develop as of today, it was noticed by @SulagnaRay-NOAA that all of the ocean grib2 files are constant values (mostly zeros). The native model output is not zeros and the ice gribs also appear to be okay.

Investigation as to what is going on and why is ongoing.

What should have happened?

We should have grib2 output files that match the native model output (and have non-zero/constant values).

What machines are impacted?

Hera

Steps to reproduce

This was discovered running a C384 test case of C384mx025_3DVarAOWCDA. However, I suspect other test cases would expose this issue as well.

Some example output can be found here: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/gfs.20210703/06/products/ocean/grib2/0p25

Log files can be found here: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/logs/2021070306

Additional information

@GwenChen-NOAA @jiandewang @SulagnaRay-NOAA @LydiaStefanova-NOAA @guillaumevernieres @CatherineThomas-NOAA FYI - any additional information or help is appreciated!

Do you have a proposed solution?

Not yet...

jiandewang commented 1 month ago

@JessicaMeixner-NOAA we need to check the regular grid ocean nc files (which is used as input for converting to grib2) but they were erased in the g-w runs. For example the following doean't exist anymore: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.2629439

JessicaMeixner-NOAA commented 1 month ago

@jiandewang I'll rewind and re-run one of them and save the rundir. I'll post back here when I have that.

JessicaMeixner-NOAA commented 1 month ago

Here's the saved output @jiandewang :

TMP: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.4064953 LOG: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/logs/2021070306/gfsocean_prod_f234-f240.log COM: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/gfs.20210703/06/products/ocean

jiandewang commented 1 month ago

@JessicaMeixner-NOAA quick check for these three files: ocean.nc: ocean native grid master file, looks good ocean.0p25.nc: regular grid, all zero ocean.1p00.nc: regular grid, all zero

so the problem happened on tripolar to regular step, let me go through log file to see if there is any clue

jiandewang commented 1 month ago

@JessicaMeixner-NOAA can you re-run it but set debug to true ? see last line /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.4064953/ocnicepost.nml

JessicaMeixner-NOAA commented 1 month ago

@jiandewang here's the output with debug=true: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181

aerorahul commented 1 month ago

The output with debug = .true. is tracing the code execution. I did a ncview and ncdump on intermediate files e.g. ocean.0p25.rdbilin3d.nc, etc., but I am unable to get any clues from them. I wondered if there has been a change in the interpolation weights. So, I looked at /scratch1/NCEPDEV/global/glopara/fix/mom6/20240416/post/mx025/ and the timestamp on these files is 20240403 which seems reasonable.

If needed, I can dig deeper into the interpolation code.

JessicaMeixner-NOAA commented 1 month ago

@GwenChen-NOAA do you have an idea as to what is going on? We'd appreciate your help to determine issues here.

jiandewang commented 1 month ago

I am trying to understanding the run sequential for this post job: fcst step generate oceannativenc, then it being copied as ocean.nc and further more cut out key variables and saved as ocean_subset.nc. Which one is being used as input for post ? ocean.nc or ocean_subset.nc ?

ls -l /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181/ocean*nc

-rw-r--r-- 1 Jessica.Meixner climate 1328960900 May 22 10:46 ocean.0p25.nc -rw-r--r-- 1 Jessica.Meixner climate 83412020 May 22 10:45 ocean.1p00.nc -rw-r--r-- 1 Jessica.Meixner climate 2090477767 May 21 13:06 ocean.nc -rw-r--r-- 1 Jessica.Meixner climate 1959785283 May 22 10:46 ocean_subset.nc

ocean.1p00.nc is generated 1 minute before ocean_subset.nc

looked at line 74 /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181/ocean.post.log it shows the min/max before and after the interpolation and the # here are totally fine. But somehow when we looked at the final products, they are all zero. Really puzzled here.

GwenChen-NOAA commented 1 month ago

@JessicaMeixner-NOAA, can you provide the sea-ice PR number that just merged? It will be helpful to look at the code changes.

GwenChen-NOAA commented 1 month ago

I am trying to understanding the run sequential for this post job: fcst step generate ocean_native_nc, then it being copied as ocean.nc and further more cut out key variables and saved as ocean_subset.nc. Which one is being used as input for post? ocean.nc or ocean_subset.nc?

@jiandewang, the ocean.nc files are used to generate grib2 files. The ocean_subset.nc files are moved to the /products directory as the netcdf products to be distributed through NOMADS.

JessicaMeixner-NOAA commented 1 month ago

@jiandewang I think ocean.nc is used to create ocean_subset.nc - I could be wrong... let me look into that more.

@GwenChen-NOAA - The PR is https://github.com/NOAA-EMC/global-workflow/pull/2584 I did just confirm that output from hera from before this PR was merged also had the issue where the grib files were zero output, so the sea-ice analysis PR is not the cause of this problem. I"m not sure how long this issue has been in the develop branch, if it's just a hera issue or something else?

GwenChen-NOAA commented 1 month ago

@GwenChen-NOAA - The PR is #2584 I did just confirm that output from hera from before this PR was merged also had the issue where the grib files were zero output, so the sea-ice analysis PR is not the cause of this problem. I"m not sure how long this issue has been in the develop branch, if it's just a hera issue or something else?

@JessicaMeixner-NOAA, can you run it on WCOSS2? I know downstream package can only run on WCOSS2.

JessicaMeixner-NOAA commented 1 month ago

@GwenChen-NOAA The ocean post products should be able to be generated on RHDPCS, not just WCOSS2. I don't have a workflow set-up there right now, so it would be great if you could try that out to see if it works.

I did find an old run that I was doing when trying to update the ufs-weather-model to a more recent version and it has non-zero fields: /scratch1/NCEPDEV/climate/Jessica.Meixner/testgw2505/test02/COMROOT/test02/gfs.20191203/00/products/ocean/grib2/1p00/gfs.ocean.t00z.1p00.f072.grib2 (for example has non-zero fields). The commit of g-w was updates from an April 17th commit. We could also look into if module for hera were updated within the ufs-weather-model between the updates as I do think that this job is using the ufs-weather-model.

JessicaMeixner-NOAA commented 1 month ago

Okay I did confirm that the ufs-weather-model modules have not changed on hera, so it's not just that.

JessicaMeixner-NOAA commented 1 month ago

@EricSinsky-NOAA I see that you've been running some ocean/ice post recently. Thought I'd ping you in this to see if you've noticed that grib files of the ocean were zeros or constant in any of your testing.

EricSinsky-NOAA commented 1 month ago

@JessicaMeixner-NOAA I just ran the C48_S2SWA_gefs CI test case today using the most recent hash (7d2c539). I also see all zeroes in the gridded (5 degree) ocean data. The data is all zeroes in the gridded NetCDF data as well (not just the gridded grib2 data).

JessicaMeixner-NOAA commented 1 month ago

@JessicaMeixner-NOAA I just ran the C48_S2SWA_gefs CI test case today using a the most recent hash (7d2c539). I also see all zeroes in the gridded (5 degree) ocean data. The data is all zeroes in the gridded NetCDF data as well (not just the gridded grib2 data).

@EricSinsky-NOAA thanks for the info! what machine was that on?

EricSinsky-NOAA commented 1 month ago

@EricSinsky-NOAA thanks for the info! what machine was that on?

@JessicaMeixner-NOAA This test was on Cactus.

JessicaMeixner-NOAA commented 1 month ago

Thanks @EricSinsky-NOAA, seems like this is not just a hera issue then.

I'm re-running my case on hera where i went back and found that I had output I expected. I'm then going to merge in develop and see how that goes as well. Hopefully will have an update on that this afternoon.

JessicaMeixner-NOAA commented 1 month ago

Okay, my re-run of something where I thought I had previously had grib2 output that was non-zero, did not give me non-zeros this time.... I believe that should rule out the model version, but not sure what to look at now...

JessicaMeixner-NOAA commented 1 month ago

@GwenChen-NOAA when you tested this: https://github.com/NOAA-EMC/global-workflow/pull/2611 did you get non-zero grib2 output files?

GwenChen-NOAA commented 1 month ago

@GwenChen-NOAA when you tested this: #2611 did you get non-zero grib2 output files?

@JessicaMeixner-NOAA, my test used an old version of the ocean.0p25.nc file (i.e., latlon netcdf file output from ocnicepost) and worked fine. I saw the ocean.0p25.nc file under /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181 also contains all zero. I found a recent closed issue (#2483) that updated fix files for CICE and MOM6/post. Perhaps @DeniseWorthen can provide some clues here.

aerorahul commented 1 month ago

@GwenChen-NOAA when you tested this: #2611 did you get non-zero grib2 output files?

@JessicaMeixner-NOAA, my test used an old version of the ocean.0p25.nc file (i.e., latlon netcdf file output from ocnicepost) and worked fine. I saw the ocean.0p25.nc file under /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181 also contains all zero. I found a recent closed issue (#2483) that updated fix files for CICE and MOM6/post. Perhaps @DeniseWorthen can provide some clues here.

The issue #2483 only added/corrected the 5-degree fix file. It did not alter the 0.25-degree or 1.0-degree fix files.

JessicaMeixner-NOAA commented 1 month ago

Thanks @aerorahul for that information!

EricSinsky-NOAA commented 1 month ago

I just ran the C48_S2SW CI test case on Cactus using the 5/13/2024 commit hash (6ca106e). The gridded ocean data still consists of all zeroes as of the 5/13/2024 gw version. Will keep trying to go back to earlier commit hashes to get a better idea when and why this issue started.

JessicaMeixner-NOAA commented 1 month ago

I updated to the latest version of ufs-weather-model on hera and ran another test and got all zeros in the gribs still. @EricSinsky-NOAA we know at least the HR3 tag 6f9afff from Feb 21st has non-zero gribs on wcoss2. On hera, the furthest back of g-w would be the rocky8 transition commit.

EricSinsky-NOAA commented 1 month ago

Thank you @JessicaMeixner-NOAA for confirming that we still had non-zero gribs as of Feb 21st. @jiandewang When you checked PR #2484 on April 17th (this PR added a more strict dependency to the ocean_prod rocoto task), do you remember if the gridded netcdf/grib2 data was non-zero? I just completed a test (C48_S2SW on Cactus) from an April 16th hash and the netcdf gridded data consists of all zeroes.

jiandewang commented 1 month ago

@EricSinsky-NOAA I recall Rahul asked me to test it based on HR3 but manually modified xml file. That worked fine and I checked the ocean regular and grib2 files at that time. They were fine.

jiandewang commented 1 month ago

@EricSinsky-NOAA can you repeat C48 CI or whatever test case you have but add something like sleep 5 minutes before post job is being trigged

JessicaMeixner-NOAA commented 1 month ago

@EricSinsky-NOAA can you repeat C48 CI or whatever test case you have but add something like sleep 5 minutes before post job is being trigged

I don't think that should be the cause as my re-runs yesterday were after the full forecast was completed, so there would be no issue of files not being there completely, I would think?

EricSinsky-NOAA commented 1 month ago

Thank you @jiandewang! Sure, I am increasing the sleep time to 5 minute and am rerunning the C48 CI. However, based on what @JessicaMeixner-NOAA said about her re-runs, this might not be the reason for the zeroes in the gridded data.

jiandewang commented 1 month ago

Thank you @jiandewang! Sure, I am increasing the sleep time to 5 minute and am rerunning the C48 CI. However, based on what @JessicaMeixner-NOAA said about her re-runs, this might not be the reason for the zeroes in the gridded data.

agree, 99% chance this is not the reason

aerorahul commented 1 month ago

FWIW, I cloned the hash d6be3b5c corresponding to the PR #2421. I setup and ran a C48_S2S test on Hera.

The model output contains reasonable values. Screenshot 2024-05-23 at 11 25 12 AM

The interpolation output however, contains zeros. Screenshot 2024-05-23 at 11 25 45 AM

If there is someone willing to re-do this exact test on Orion/WCOSS2, we could narrow the issue down between the software stack on Hera and the interpolation code.

edit: One does not need to re-run the entire experiment, just clone and build this hash and re-run the ocean post code with the model output from Hera. Everything needed is in: /scratch1/NCEPDEV/stmp2/Rahul.Mahajan/RUNDIRS/zeros/oceanice_products.3127422

EricSinsky-NOAA commented 1 month ago

@aerorahul I just ran C48_S2SW on WCOSS2 using hash fa855ba from March 18th (prior to the Rocky 8 hash that you tested). The raw model ocean output contains reasonable values, but the interpolated ocean output are all zeroes.

aerorahul commented 1 month ago

I ran the ocnicepost.x on Orion with the output from Hera and the interpolated output has zeros!

JessicaMeixner-NOAA commented 1 month ago

Okay, I'm going to run the C48 S2SW Ci test with the HR3 tag on wcoss2, hopefully that works as we expect....

jiandewang commented 1 month ago

Okay, I'm going to run the C48 S2SW Ci test with the HR3 tag on wcoss2, hopefully that works as we expect....

I am repeating one of HR3 run on wcoss now

EricSinsky-NOAA commented 1 month ago

Thank you for testing the HR3 tag, @JessicaMeixner-NOAA. I just tested the gw hash (https://github.com/NOAA-EMC/global-workflow/commit/9608852784871ebf03d92b53bde891b6dcab8684) from 2/26/2024 on WCOSS2 (C48 S2SW Ci test). I am still getting all zeroes in the interpolated ocean output. I also wonder if this same issue would occur for a case initialized at 00Z.

jiandewang commented 1 month ago

just had one ocean post done (HR3 tag on wcoss2), grided ocean file looks fine. See cactus /lfs/h2/emc/ptmp/jiande.wang/HR3-work/RUNDIRS/HR3-20191203/ocean.1p00.nc

EricSinsky-NOAA commented 1 month ago

Thank you @jiandewang. It looks like the case you tested is a 00Z run (2019120300). It will be interesting to see if @JessicaMeixner-NOAA also gets reasonable gridded ocean output for the C48 CI test case (12Z run).

jiandewang commented 1 month ago

one more clue: in my just finished wcoss2 HR3 run, /lfs/h2/emc/ptmp/jiande.wang/HR3-work/RUNDIRS/HR3-20191203/oceanice_products.73074 jiande.wang@clogin02:/lfs/h2/emc/ptmp/jiande.wang/HR3-work/RUNDIRS/HR3-20191203/oceanice_products.73074> ls -l tr*nc -rw-r--r-- 1 jiande.wang emc 443660244 Oct 25 2023 tripole.mx025.Bu.to.Ct.bilinear.nc -rw-r--r-- 1 jiande.wang emc 322230100 Oct 25 2023 tripole.mx025.Ct.to.rect.0p25.bilinear.nc -rw-r--r-- 1 jiande.wang emc 344591848 Oct 25 2023 tripole.mx025.Ct.to.rect.0p25.conserve.nc -rw-r--r-- 1 jiande.wang emc 165958772 Oct 25 2023 tripole.mx025.Ct.to.rect.1p00.bilinear.nc -rw-r--r-- 1 jiande.wang emc 193551336 Oct 25 2023 tripole.mx025.Ct.to.rect.1p00.conserve.nc -rw-r--r-- 1 jiande.wang emc 410574804 Oct 25 2023 tripole.mx025.Cu.to.Ct.bilinear.nc -rw-r--r-- 1 jiande.wang emc 443660244 Oct 25 2023 tripole.mx025.Cv.to.Ct.bilinear.nc

but in Jessica's yesterday's run on HERA, it was /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.344818 but I think Jessica deleted it. Lucky I made a copy of it yesterday. So if you see HERA /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181

/scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181[119]ll tr*nc -r--r--r-- 1 Jiande.Wang climate 443660268 Apr 3 14:09 tripole.mx025.Bu.to.Ct.bilinear.nc -r--r--r-- 1 Jiande.Wang climate 322230132 Apr 3 14:09 tripole.mx025.Ct.to.rect.0p25.bilinear.nc -r--r--r-- 1 Jiande.Wang climate 344591884 Apr 3 14:09 tripole.mx025.Ct.to.rect.0p25.conserve.nc -r--r--r-- 1 Jiande.Wang climate 165958804 Apr 3 14:09 tripole.mx025.Ct.to.rect.1p00.bilinear.nc -r--r--r-- 1 Jiande.Wang climate 193551372 Apr 3 14:09 tripole.mx025.Ct.to.rect.1p00.conserve.nc -r--r--r-- 1 Jiande.Wang climate 410574828 Apr 3 14:09 tripole.mx025.Cu.to.Ct.bilinear.nc -r--r--r-- 1 Jiande.Wang climate 443660268 Apr 3 14:10 tripole.mx025.Cv.to.Ct.bilinear.nc

are they suppose to be the same size ?

JessicaMeixner-NOAA commented 1 month ago

@jiandewang /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181 is still there? Not sure what was going on.

Sometimes different machines will calculate size a little differently.

jiandewang commented 1 month ago

@JessicaMeixner-NOAA I scp-ed fixed file to HERA, they are not the same. You can do cmp between /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181/fixed-file-wcoss2 and /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181

JessicaMeixner-NOAA commented 1 month ago

Thank you @jiandewang. It looks like the case you tested is a 00Z run (2019120300). It will be interesting to see if @JessicaMeixner-NOAA also gets reasonable gridded ocean output for the C48 CI test case (12Z run).

@EricSinsky-NOAA - I did look at the output from the cycled test that sparked this issue and 00z is 0s as well for that run.

jiandewang commented 1 month ago

@EricSinsky-NOAA will you be able to repeat your run on HERA but use fixed file I just copied from wcoss2 ? They are at /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181/fixed-file-wcoss2

EricSinsky-NOAA commented 1 month ago

@jiandewang Sure, I'll rerun using the fixed files from /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/oceanice_products.3448181/fixed-file-wcoss2.

EricSinsky-NOAA commented 1 month ago

@jiandewang Do you have equivalent fix files for mx500 resolution? The CI test case I have been running has MOM6 at mx500.

jiandewang commented 1 month ago

@EricSinsky-NOAA no I don't have. HR3 is only for 025 ocean

EricSinsky-NOAA commented 1 month ago

@jiandewang After looking more closely at the fix files you are using for your HR3 runs, it looks like you are using an older version of the fix files from 20231219. Your fix files are identical in size to those found here in glopara: /scratch1/NCEPDEV/global/glopara/fix/mom6/20231219/post/mx025/ (/lfs/h2/emc/global/noscrub/emc.global/FIX/fix/mom6/20231219/post/mx025/ on WCOSS2)

In my test runs (and I believe @JessicaMeixner-NOAA's test runs too), I have been using a newer version of the fix files from 20240416, which are found here in glopara: /scratch1/NCEPDEV/global/glopara/fix/mom6/20240416/post/mx025/