gdasatmanlprod fails in C48mx500_3DVarAOWCDA CI on Hera

RussTreadon-NOAA commented 5 months ago

What is wrong?

The 20210324 18Z gdasatmanlprod fails when running C48mx500_3DVarAOWCDA CI on Hera. Investigation of the failure (see PR #2641) points to nonphysical values in the precipitation field.

What should have happened?

gdasatmanlprod should run to completion without error

What machines are impacted?

Hera

Steps to reproduce

install develop at 9caa51de
set up and run C48mx500_3DVarAOWCDA CI
20210324 18 Z gdasatmanlprod fails

Additional information

The specific log file in the run directory containing the error is mpmd.${jobid}.21.out

21: + interp_atmos_master.sh[44](tmpfile_anl_22): wgrib2 tmpfile_anl_22 -set_grib_type same -set_bitmap 1 -set_grib_max_bits 16 -new_grid_\
winds earth -new_grid_interpolation bilinear -if ':(CSNOW|CRAIN|CFRZR|CICEP|ICSEV):' -new_grid_interpolation neighbor -fi -if ':(APCP|ACPC\
P|PRATE|CPRAT|DZDT):' -new_grid_interpolation budget -fi -if ':(APCP|ACPCP|PRATE|CPRAT):' -set_grib_max_bits 25 -fi -new_grid latlon 0:144\
0:0.25 90:721:-0.25 pgb2file_anl_22_0p25 -new_grid latlon 0:720:0.5 90:361:-0.5 pgb2file_anl_22_0p50 -new_grid latlon 0:360:1.0 90:181:-1.\
0 pgb2file_anl_22_1p00
21:
21: *** FATAL ERROR: add_many_bitstream: n_bits = (26) ***
21:
21: 1:0:d=2021032418:MSLET:mean sea level:anl:
21: 2:27834:d=2021032418:HGT:1000 mb:anl:
21: 3:54708:d=2021032418:PRES:surface:anl:
21: 4:81277:d=2021032418:HGT:surface:anl:
21: + interp_atmos_master.sh[1](tmpfile_anl_22): postamble interp_atmos_master.sh 1717869240 8

The failure occurs while processing the record 5, PRATE. Below is a wgrib2 -V listing for PRATE.

5:97347:vt=2021032418:surface:anl:PRATE Precipitation Rate [kg/m^2/s]:
    ndata=18432:undef=4148:mean=4.25403e+16:min=-6.08487e+19:max=5.02802e+20
    grid_template=40:winds(N/S):
        Gaussian grid: (192 x 96) units 1e-06 input WE:NS output WE:SN
        number of latitudes between pole-equator=48 #points=18432
        lat 88.572166 to -88.572166
        lon 0.000000 to 358.125000 by 1.875000

The PRATE values are non-physical.

Do you have a proposed solution?

No response

aerorahul commented 5 months ago

@WenMeng-NOAA Would you be able to advise on this issue resolution?

CatherineThomas-NOAA commented 5 months ago

We have output from a C384 3DVar WCDA run on Hera. Looking at one of the master grbanl files:

/scratch1/NCEPDEV/climate/Jiande.Wang/working/g-w-cycle/cycle/C02/COMROOT/C02/gdas.20210711/00/model_data/atmos/master/gdas.t00z.master.grb2anl

I can see reasonable values:

776:590889415:vt=2021071100:surface:anl:PRATE Precipitation Rate [kg/m^2/s]:
    ndata=1179648:undef=0:mean=9.92491e-05:min=0:max=0.06556

This experiment uses the g-w develop @ 7d2c539. Unclear to me at the moment whether this is due to a more recent commit or if it's resolution based.

RussTreadon-NOAA commented 5 months ago

Thank you, @CatherineThomas-NOAA . This is a useful data point.

CatherineThomas-NOAA commented 5 months ago

An additional data point: @JessicaMeixner-NOAA is running a test for the PR to update the UFSWM. It's also run at C384. The PRATE values are reasonable.

I also noticed in the failed case that PRATE isn't the only bad value. Vegetation seems to have issues as well:

777:14387010:vt=2021032418:surface:anl:VEG Vegetation [%]:
    ndata=18432:undef=13213:mean=-6.89787e+13:min=-1.14e+18:max=9.3e+17

WenMeng-NOAA commented 5 months ago

@CatherineThomas-NOAA Can you provide me the data location for the bad case?

RussTreadon-NOAA commented 5 months ago

@WenMeng-NOAA , the job log file for the failed gdasatmanlprod is Hera /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasatmanlprod.log.

You should be able to trace what you need from this and other log files in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418

CatherineThomas-NOAA commented 5 months ago

@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:

/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc

contains values of Infinity in the tprcp and veg variables. If we look at the sfcanl tile files

/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc

the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl

/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log

the read values are also reasonable:

0:  - TPRCP:   2.807180728647230E-003  0.000000000000000E+000
0:  - VFRAC:   0.958607382950236       0.000000000000000E+000

Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.

@GeorgeGayno-NOAA Any ideas?

WenMeng-NOAA commented 5 months ago

@CatherineThomas-NOAA Yes, the post job read gdas.tHHz.sfcanl.nc in gaussian grid. The UPP directly read 'tprcp' and 'veg' from the model output and write out in grib2.

GeorgeGayno-NOAA commented 5 months ago

@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc
contains values of Infinity in the tprcp and veg variables. If we look at the sfcanl tile files
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc
the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log
the read values are also reasonable:
0:  - TPRCP:   2.807180728647230E-003  0.000000000000000E+000
0:  - VFRAC:   0.958607382950236       0.000000000000000E+000
Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.

@GeorgeGayno-NOAA Any ideas?

No. Let me take a look at these files. Don't delete them.

GeorgeGayno-NOAA commented 5 months ago

@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc

In the above file, I am seeing bad values in multiple records. Clearly something is wrong.

contains values of Infinity in the tprcp and veg variables. If we look at the sfcanl tile files
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc
the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log
the read values are also reasonable:
0:  - TPRCP:   2.807180728647230E-003  0.000000000000000E+000
0:  - VFRAC:   0.958607382950236       0.000000000000000E+000
Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.

@GeorgeGayno-NOAA Any ideas?

CatherineThomas-NOAA commented 4 months ago

I set up a C48 test case for debugging the gaussian sfcanl files. The task that creates the these files is the analcalc step with the gaussian_sfcanl utility in gfs_utils.

I was able to reproduce the issue where Infinity is seen in veg and tprcp. Looking at the gaussian_sfcanl.f90 source code, I see where these variables are allocated:

 allocate(gaussian_data%vfrac(igaus*jgaus))

Then values are assigned to gaussian_data from tile_data, including the previous value of gaussian_data:

   gaussian_data%vfrac(row(i))  = gaussian_data%vfrac(row(i)) + s(i)*tile_data%vfrac(col(i))

However, the gaussian_data variable is not initialized before being used in this statement. I added some print statements after allocation and found that occasionally the initial value of gaussian_data%vfrac was unreasonable:

0:  i, s(i), row(i),col(i), gaussian before     10923   1.00000000000000
0:      10923       681  4.221828786676957E+208

I added a simple initialization for vfrac only and found that the resulting veg variable in the sfcanl.nc file no longer had the infinity values seen previously. You can compare the veg values in these two gaussian sfcanl files:

/scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/c48test/c48test/COMROOT/c48test/gdas.20210324/18/analysis/atmos/
gdas.t18z.sfcanl.nc.noinit
gdas.t18z.sfcanl.nc.initveg

RussTreadon-NOAA commented 4 months ago

I forgot about this issue. I ran the C48mx500_3DVarAOWCDA CI case twice this weekend on Hera. Neither run encountered a failure in the 20210324 18Z gdasatmanlprod. These runs were made while testing PR #2700.

A check of veg and tprcp in /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc finds odd tprcp values such as

    -32.89752, 1.168709e-13, 0, 7.658835e-09, Infinityf, -0, -Infinityf,
    Infinityf, -0, -Infinityf, 6.067901e+20, -0, Infinityf, -0, 0, -0,
    -Infinityf, -Infinityf, 0, 3.976473e+25, 0, -0, Infinityf, -Infinityf,
    Infinityf, 0, -1.783102e-18, 2155648, -Infinityf, 9.258458e+15,

and veg values of

  0, 0, 0, 0, 0, 46.79983, 57.70362, 57.70362, Infinityf, 64.81725, 70.34476,
    61.73597, 52.91533, 81.23689, 77.26335, 2.604275e+32, 73.47117, 75.03622,
    79.27757, 81.61945, 82.1152, 91.14878, 80.44361, 0, 0, 0, Infinityf, 0,

A wgrib2 check of /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/gdas.20210324/18/model_data/atmos/master/gdas.t18z.master.grb2anl finds extreme min and max values for PRATE

776:14382912:vt=2021032418:surface:anl:PRATE Precipitation Rate [kg/m^2/s]:
    ndata=18432:undef=4713:mean=-3.91019e+13:min=-1.02901e+17:max=1.02911e+17
    grid_template=40:winds(N/S):
        Gaussian grid: (192 x 96) units 1e-06 input WE:NS output WE:SN
        number of latitudes between pole-equator=48 #points=18432
        lat 88.572166 to -88.572166
        lon 0.000000 to 358.125000 by 1.875000

Despite this, the 18Z gdasatmanlprod ran to completion as shown below by the end of /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/logs/2021032418/gdasatmanlprod.log

+ JGLOBAL_ATMOS_PRODUCTS[44]: rm -rf /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/pr2700_wcda/atmos_products.2844626
+ JGLOBAL_ATMOS_PRODUCTS[47]: exit 0
+ JGLOBAL_ATMOS_PRODUCTS[1]: postamble JGLOBAL_ATMOS_PRODUCTS 1719148988 0
+ preamble.sh[70]: set +x
End JGLOBAL_ATMOS_PRODUCTS at 13:24:24 with error code 0 (time elapsed: 00:01:16)
+ atmos_products.sh[27]: exit 0
+ atmos_products.sh[1]: postamble atmos_products.sh 1719148984 0
+ preamble.sh[70]: set +x
End atmos_products.sh at 13:24:24 with error code 0 (time elapsed: 00:01:20)
_______________________________________________________________
Start Epilog on node h10c12 for job 62300616 :: Sun Jun 23 13:24:24 UTC 2024
Job 62300616 finished for user role.jedipara in partition hera with exit code 0:0
_______________________________________________________________
End Epilogue Sun Jun 23 13:24:24 UTC 2024

CatherineThomas-NOAA commented 4 months ago

@RussTreadon-NOAA - Interesting that the atmanlprod job does not necessarily fail when the values in the sfcanl.nc file are Infinity. I'm glad it did at least once so we could find this issue and get it fixed. I've open an issue in gfs-utils.

RussTreadon-NOAA commented 4 months ago

@CatherineThomas-NOAA , the pass behavior in the Hera role.jedipara test puzzles me given what previously happened.

DavidHuber-NOAA commented 4 months ago

@RussTreadon-NOAA @CatherineThomas-NOAA Just adding a note that I ran into this issue on Friday while testing #2672 directly on Hera, but the CI tests passed successfully when launched from GitHub. Perhaps it is an intermittent issue?

RussTreadon-NOAA commented 4 months ago

This is useful information @DavidHuber-NOAA . CI triggered in different ways or run by different users yield different results. We could, indeed, be dealing with an intermittent problem. Has anyone observed gdasatmanlprod failures on other machines?

DavidHuber-NOAA commented 4 months ago

@RussTreadon-NOAA Not that I am aware of, but I don't know if there are initial conditions available on other machines to run the AOWCDA case.

RussTreadon-NOAA commented 4 months ago

C48mx500_3DVarAOWCDA set up on Hercules. Unfortunately the 20210324 12Z half-cycle gdasfcst fails with the following traceback

 0:  in fv3cap init, time wrtcrt/regrdst  0.452040272299200
 0:  in fv3 cap init, output_startfh=   6.000000      iau_offset=           6
 0:  output_fh=   6.000000       9.000000       12.00000       15.00000
 0:  lflname_fulltime= F
 0:  fcst_advertise, cpl_grid_id=           1
12:  mesh file for mom6 domain is mesh.mx500.nc
 0:  fcst_realize, cpl_grid_id=           1
 0:  zeroing coupling accumulated fields at kdt=           10
 0:  zeroing coupling accumulated fields at kdt=           10
 0: -------->(med_phases_restart_read) mediating for: 2021  3 24  9  0  0   0
 1: Abort(1) on node 1 (rank 1 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 1
 5: Abort(1) on node 5 (rank 5 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 5
 3: Abort(1) on node 3 (rank 3 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 3
 4: Abort(1) on node 4 (rank 4 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 4
 2: Abort(1) on node 2 (rank 2 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 2
 0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
srun: error: hercules-01-32: tasks 0-79: Exited with exit code 1
srun: Terminating StepId=1590474.0

C48mx500_3DVarAOWCDA set up on Dogwood. The 20210324 12Z half-cycle gdasfcst runs to completion on Dogwood. This issue will be updated once the Dogwood run reaches gdasatmanlprod.

Why does 20210324 12Z gdasfcst fail on Hercules? Has anyone else observed this?

RussTreadon-NOAA commented 4 months ago

Attempts to run C48mx500_3DVarAOWCDA on Dogwood fail with

  File "/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/rename_atm/ush/python/wxflow/fsutils.py", line 87, in cp
    raise OSError(f"unable to copy {source} to {target}")
OSError: unable to copy /scratch2/NCEPDEV/ocean/Guillaume.Vernieres/data/static/72x35x25/soca/rossrad.nc to /lfs/h2/emc/stmp/russ.treadon/RUNDIRS/pr2700_wcda/gdasocnanal_18/rossrad.nc
+ JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP[47]: status=1

in 20210324 18Z gdasocnanalprep. File parm/config/gfs/yaml/defaults.yaml uses a Hera specific path for SOCA_INPUT_FIX_DIR.

I forgot that this problem has been already been reported. See g-w issue #2683.

DavidHuber-NOAA commented 4 months ago

Ran three tests on Hera, all showed invalid PRATE values at the same locations in gdas.t18z.master.grb2anl GRIB2 file. Traced back to the sfcanl netCDF input file to the gdasatmanlupp job and found multiple instances of Infinityf and -Infinityf in the tprcp field (total precipitation). Will continue to trace this back to the gdassfcanl job.

CatherineThomas-NOAA commented 4 months ago

No need @DavidHuber-NOAA. It's being addressed in https://github.com/NOAA-EMC/gfs-utils/issues/71

DavidHuber-NOAA commented 4 months ago

Great, thank you @CatherineThomas-NOAA!

GeorgeGayno-NOAA commented 4 months ago

Great, thank you @CatherineThomas-NOAA!

Will issue a PR soon.

NOAA-EMC / global-workflow