Closed RussTreadon-NOAA closed 3 months ago
@WenMeng-NOAA Would you be able to advise on this issue resolution?
We have output from a C384 3DVar WCDA run on Hera. Looking at one of the master grbanl files:
/scratch1/NCEPDEV/climate/Jiande.Wang/working/g-w-cycle/cycle/C02/COMROOT/C02/gdas.20210711/00/model_data/atmos/master/gdas.t00z.master.grb2anl
I can see reasonable values:
776:590889415:vt=2021071100:surface:anl:PRATE Precipitation Rate [kg/m^2/s]:
ndata=1179648:undef=0:mean=9.92491e-05:min=0:max=0.06556
This experiment uses the g-w develop @ 7d2c539. Unclear to me at the moment whether this is due to a more recent commit or if it's resolution based.
Thank you, @CatherineThomas-NOAA . This is a useful data point.
An additional data point: @JessicaMeixner-NOAA is running a test for the PR to update the UFSWM. It's also run at C384. The PRATE values are reasonable.
I also noticed in the failed case that PRATE isn't the only bad value. Vegetation seems to have issues as well:
777:14387010:vt=2021032418:surface:anl:VEG Vegetation [%]:
ndata=18432:undef=13213:mean=-6.89787e+13:min=-1.14e+18:max=9.3e+17
@CatherineThomas-NOAA Can you provide me the data location for the bad case?
@WenMeng-NOAA , the job log file for the failed gdasatmanlprod is Hera /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasatmanlprod.log
.
You should be able to trace what you need from this and other log files in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418
@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc
contains values of Infinity
in the tprcp and veg variables. If we look at the sfcanl tile files
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc
the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log
the read values are also reasonable:
0: - TPRCP: 2.807180728647230E-003 0.000000000000000E+000
0: - VFRAC: 0.958607382950236 0.000000000000000E+000
Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.
@GeorgeGayno-NOAA Any ideas?
@CatherineThomas-NOAA Yes, the post job read gdas.tHHz.sfcanl.nc in gaussian grid. The UPP directly read 'tprcp' and 'veg' from the model output and write out in grib2.
@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc
contains values of
Infinity
in the tprcp and veg variables. If we look at the sfcanl tile files/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc
the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log
the read values are also reasonable:
0: - TPRCP: 2.807180728647230E-003 0.000000000000000E+000 0: - VFRAC: 0.958607382950236 0.000000000000000E+000
Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.
@GeorgeGayno-NOAA Any ideas?
No. Let me take a look at these files. Don't delete them.
@WenMeng-NOAA Does the gdasatmanlprod job use the gdas.tHHz.sfcanl.nc gaussian files? If so, the problem likely originates before the post runs. The file:
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc
In the above file, I am seeing bad values in multiple records. Clearly something is wrong.
contains values of
Infinity
in the tprcp and veg variables. If we look at the sfcanl tile files/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/gdas.20210324/18/model_data/atmos/restart/20210324.180000.sfcanl_data.tile1.nc
the values appear reasonable for tprcp and vfrac. Now looking at the log for creating the gaussian sfcanl
/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prwcda_dev/logs/2021032418/gdasanalcalc.log
the read values are also reasonable:
0: - TPRCP: 2.807180728647230E-003 0.000000000000000E+000 0: - VFRAC: 0.958607382950236 0.000000000000000E+000
Do we have another case of mismatched fix files and masks? We spent some time digging into global_cycle, but not into gaussian_sfcanl. If this is the problem though, why would these be the only two variables affected? Vegetation makes sense for a land mask issue, but precip does not.
@GeorgeGayno-NOAA Any ideas?
I set up a C48 test case for debugging the gaussian sfcanl files. The task that creates the these files is the analcalc step with the gaussian_sfcanl utility in gfs_utils.
I was able to reproduce the issue where Infinity
is seen in veg
and tprcp
. Looking at the gaussian_sfcanl.f90 source code, I see where these variables are allocated:
allocate(gaussian_data%vfrac(igaus*jgaus))
Then values are assigned to gaussian_data from tile_data, including the previous value of gaussian_data:
gaussian_data%vfrac(row(i)) = gaussian_data%vfrac(row(i)) + s(i)*tile_data%vfrac(col(i))
However, the gaussian_data variable is not initialized before being used in this statement. I added some print statements after allocation and found that occasionally the initial value of gaussian_data%vfrac was unreasonable:
0: i, s(i), row(i),col(i), gaussian before 10923 1.00000000000000
0: 10923 681 4.221828786676957E+208
I added a simple initialization for vfrac
only and found that the resulting veg variable in the sfcanl.nc file no longer had the infinity values seen previously. You can compare the veg values in these two gaussian sfcanl files:
/scratch1/NCEPDEV/da/Catherine.Thomas/git/global-workflow/c48test/c48test/COMROOT/c48test/gdas.20210324/18/analysis/atmos/
gdas.t18z.sfcanl.nc.noinit
gdas.t18z.sfcanl.nc.initveg
I forgot about this issue. I ran the C48mx500_3DVarAOWCDA CI case twice this weekend on Hera. Neither run encountered a failure in the 20210324 18Z gdasatmanlprod. These runs were made while testing PR #2700.
A check of veg
and tprcp
in /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/gdas.20210324/18/analysis/atmos/gdas.t18z.sfcanl.nc
finds odd tprcp
values such as
-32.89752, 1.168709e-13, 0, 7.658835e-09, Infinityf, -0, -Infinityf,
Infinityf, -0, -Infinityf, 6.067901e+20, -0, Infinityf, -0, 0, -0,
-Infinityf, -Infinityf, 0, 3.976473e+25, 0, -0, Infinityf, -Infinityf,
Infinityf, 0, -1.783102e-18, 2155648, -Infinityf, 9.258458e+15,
and veg
values of
0, 0, 0, 0, 0, 46.79983, 57.70362, 57.70362, Infinityf, 64.81725, 70.34476,
61.73597, 52.91533, 81.23689, 77.26335, 2.604275e+32, 73.47117, 75.03622,
79.27757, 81.61945, 82.1152, 91.14878, 80.44361, 0, 0, 0, Infinityf, 0,
A wgrib2
check of /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/gdas.20210324/18/model_data/atmos/master/gdas.t18z.master.grb2anl
finds extreme min and max values for PRATE
776:14382912:vt=2021032418:surface:anl:PRATE Precipitation Rate [kg/m^2/s]:
ndata=18432:undef=4713:mean=-3.91019e+13:min=-1.02901e+17:max=1.02911e+17
grid_template=40:winds(N/S):
Gaussian grid: (192 x 96) units 1e-06 input WE:NS output WE:SN
number of latitudes between pole-equator=48 #points=18432
lat 88.572166 to -88.572166
lon 0.000000 to 358.125000 by 1.875000
Despite this, the 18Z gdasatmanlprod ran to completion as shown below by the end of /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/pr2700_wcda/logs/2021032418/gdasatmanlprod.log
+ JGLOBAL_ATMOS_PRODUCTS[44]: rm -rf /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/pr2700_wcda/atmos_products.2844626
+ JGLOBAL_ATMOS_PRODUCTS[47]: exit 0
+ JGLOBAL_ATMOS_PRODUCTS[1]: postamble JGLOBAL_ATMOS_PRODUCTS 1719148988 0
+ preamble.sh[70]: set +x
End JGLOBAL_ATMOS_PRODUCTS at 13:24:24 with error code 0 (time elapsed: 00:01:16)
+ atmos_products.sh[27]: exit 0
+ atmos_products.sh[1]: postamble atmos_products.sh 1719148984 0
+ preamble.sh[70]: set +x
End atmos_products.sh at 13:24:24 with error code 0 (time elapsed: 00:01:20)
_______________________________________________________________
Start Epilog on node h10c12 for job 62300616 :: Sun Jun 23 13:24:24 UTC 2024
Job 62300616 finished for user role.jedipara in partition hera with exit code 0:0
_______________________________________________________________
End Epilogue Sun Jun 23 13:24:24 UTC 2024
@RussTreadon-NOAA - Interesting that the atmanlprod job does not necessarily fail when the values in the sfcanl.nc file are Infinity
. I'm glad it did at least once so we could find this issue and get it fixed. I've open an issue in gfs-utils.
@CatherineThomas-NOAA , the pass behavior in the Hera role.jedipara test puzzles me given what previously happened.
@RussTreadon-NOAA @CatherineThomas-NOAA Just adding a note that I ran into this issue on Friday while testing #2672 directly on Hera, but the CI tests passed successfully when launched from GitHub. Perhaps it is an intermittent issue?
This is useful information @DavidHuber-NOAA . CI triggered in different ways or run by different users yield different results. We could, indeed, be dealing with an intermittent problem. Has anyone observed gdasatmanlprod failures on other machines?
@RussTreadon-NOAA Not that I am aware of, but I don't know if there are initial conditions available on other machines to run the AOWCDA case.
C48mx500_3DVarAOWCDA set up on Hercules. Unfortunately the 20210324 12Z half-cycle gdasfcst fails with the following traceback
0: in fv3cap init, time wrtcrt/regrdst 0.452040272299200
0: in fv3 cap init, output_startfh= 6.000000 iau_offset= 6
0: output_fh= 6.000000 9.000000 12.00000 15.00000
0: lflname_fulltime= F
0: fcst_advertise, cpl_grid_id= 1
12: mesh file for mom6 domain is mesh.mx500.nc
0: fcst_realize, cpl_grid_id= 1
0: zeroing coupling accumulated fields at kdt= 10
0: zeroing coupling accumulated fields at kdt= 10
0: -------->(med_phases_restart_read) mediating for: 2021 3 24 9 0 0 0
1: Abort(1) on node 1 (rank 1 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 1
5: Abort(1) on node 5 (rank 5 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 5
3: Abort(1) on node 3 (rank 3 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 3
4: Abort(1) on node 4 (rank 4 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 4
2: Abort(1) on node 2 (rank 2 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 2
0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
srun: error: hercules-01-32: tasks 0-79: Exited with exit code 1
srun: Terminating StepId=1590474.0
C48mx500_3DVarAOWCDA set up on Dogwood. The 20210324 12Z half-cycle gdasfcst runs to completion on Dogwood. This issue will be updated once the Dogwood run reaches gdasatmanlprod.
Why does 20210324 12Z gdasfcst fail on Hercules? Has anyone else observed this?
Attempts to run C48mx500_3DVarAOWCDA on Dogwood fail with
File "/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/rename_atm/ush/python/wxflow/fsutils.py", line 87, in cp
raise OSError(f"unable to copy {source} to {target}")
OSError: unable to copy /scratch2/NCEPDEV/ocean/Guillaume.Vernieres/data/static/72x35x25/soca/rossrad.nc to /lfs/h2/emc/stmp/russ.treadon/RUNDIRS/pr2700_wcda/gdasocnanal_18/rossrad.nc
+ JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP[47]: status=1
in 20210324 18Z gdasocnanalprep. File parm/config/gfs/yaml/defaults.yaml
uses a Hera specific path for SOCA_INPUT_FIX_DIR
.
I forgot that this problem has been already been reported. See g-w issue #2683.
Ran three tests on Hera, all showed invalid PRATE
values at the same locations in gdas.t18z.master.grb2anl
GRIB2 file. Traced back to the sfcanl
netCDF input file to the gdasatmanlupp
job and found multiple instances of Infinityf
and -Infinityf
in the tprcp
field (total precipitation). Will continue to trace this back to the gdassfcanl
job.
No need @DavidHuber-NOAA. It's being addressed in https://github.com/NOAA-EMC/gfs-utils/issues/71
Great, thank you @CatherineThomas-NOAA!
Great, thank you @CatherineThomas-NOAA!
Will issue a PR soon.
What is wrong?
The 20210324 18Z gdasatmanlprod fails when running C48mx500_3DVarAOWCDA CI on Hera. Investigation of the failure (see PR #2641) points to nonphysical values in the precipitation field.
What should have happened?
gdasatmanlprod should run to completion without error
What machines are impacted?
Hera
Steps to reproduce
develop
at 9caa51deAdditional information
The specific log file in the run directory containing the error is
mpmd.${jobid}.21.out
The failure occurs while processing the record 5, PRATE. Below is a
wgrib2 -V
listing for PRATE.The PRATE values are non-physical.
Do you have a proposed solution?
No response