NaN in T_2m for ne1024 on frontier

ndkeen commented 1 year ago

After 35 days of simulation, a ne1024 case running on frontier crashed.

model date =   20190905

Atmosphere step = 30520
  model time = 2019-09-05 07:46:40

 8723: terminate called after throwing an instance of 'std::logic_error'
 8723:   what():  /lustre/orion/cli115/proj-shared/noel/wacmy/machines_frontier/components/eamxx/src/share/atm_process/atmosphere_process.cpp:432: FAIL:
 8723: false
 8723: Error! Failed post-condition property check (cannot be repaired).
 8723:   - Atmosphere process name: SurfaceCouplingImporter
 8723:   - Property check name: NaN check for field T_2m
 8723:   - Atmosphere process MPI Rank: 8723
 8723:   - Message: FieldNaNCheck failed.
 8723:   - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
 8723:   - entry (16066370)
 8723:   - lat/lon: (29.664181, 265.847168)

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C.O1

oksanaguba commented 1 year ago

This run seems to be with IC files from Chris. These runs have a lot of warnings about cold T at the bottom from the very beginning,

2937: WARNING:CAAR: k=128,theta(k)=94.938653<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=99.903874<100.000000=th_thresh, applying limiter 
2937: WARNING:CAAR: k=128,theta(k)=98.570180<100.000000=th_thresh, applying limiter

and the number of warnings only grows from log file to log file. Is the issue with IC, or should we try lowering dynamics dt, tuning HV or any other diffusive mechanisms?

oksanaguba commented 1 year ago

In a run without cess/dyamond IC changes, i only have these warnings in 9 days

/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1354477.230617-104655.gz: 6509: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14171: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz: 2269: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
/ccs/home/onguba/eff/fgpu-build-june13b-t1-r16384/run/e3sm.log.1355254.230617-170829.gz:14407:  WARNING: BalanceCheck: soil balance error (W/m2)

ndkeen commented 1 year ago

I was running a separate case (to test restarts and general stability) that only differed in the opt level used in 2 files. And the case failed in the same way and I'm going to guess they are likely BFB.

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.C

They fail at same step, same test in error message, and on the same MPI. Atmosphere step = 30520

8723: Error! Failed post-condition property check (cannot be repaired).
 8723:   - Atmosphere process name: SurfaceCouplingImporter
 8723:   - Property check name: NaN check for field T_2m
 8723:   - Atmosphere process MPI Rank: 8723
 8723:   - Message: FieldNaNCheck failed.
 8723:   - field id: T_2m[Physics PG2] <double:ncol>(1536) [K]
 8723:   - entry (16066370)
 8723:   - lat/lon: (29.664181, 265.847168)
 8723: 
 8723:  *************************** INPUT FIELDS ******************************
 8723: 
 8723:   ------- INPUT FIELDS -------
 8723: 
 8723:  ************************** OUTPUT FIELDS ******************************
 8723:      T_2m<ncol>(1536)
 8723: 
 8723:   T_2m(225)
 8723:     nan, 
 8723:  -----------------------------------------------------------------------
 8723:      landfrac<ncol>(1536)
 8723: 
 8723:   landfrac(225)
 8723:     0.998098, 
 8723:  -----------------------------------------------------------------------
 8723:      ocnfrac<ncol>(1536)
 8723: 
 8723:   ocnfrac(225)
 8723:     0.00190223, 
 8723:  -----------------------------------------------------------------------

crterai commented 1 year ago

Something is definitely off with this simulation. Here's T_2m (note the color axis):

crterai commented 1 year ago

I think it's worth running 2-3 day tests with frequent output and checking what's going on.. maybe we can try at ne256.

crterai commented 1 year ago

Noel shared with me the ne256 simulation and it looks good: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/mf00/t.machines_frontier.F2010-SCREAMv1.ne256pg2_ne256pg2.frontier-scream-gpu.n0384t8x6.vth200.Blong/run/output.scream.monthly.AVERAGE.nmonths_x1.2019-08-01-00000.nc

Thinking about the differences, the ne1024 run was monthly output with frequent restarts in between. It could be an issue about the output working correctly, rather than the model state itself.. will need to run tests.

PeterCaldwell commented 1 year ago

It's interesting that the spatial distribution between ne256 and ne1024 is ~identical. It's just the colorbar that is different. This makes me think that (like Chris surmises) the issue is a matter of dividing the accumulated sum by the number of samples before writing output...

bartgol commented 1 year ago

@crterai are you running off of a recent master? There was an issue a while ago regarding restart of accumulated quantities, but it got fixed a few weeks ago. So unless your repo is quite old, I would not expect this.

If it is, in fact, a matter of restarts, a simple ne4 case should reproduce the same problem. Could you paste here the details of the 1024 run that had frequent restarts? I can try to dig a bit to see if there are bugs in the restart logic.

ndkeen commented 1 year ago

From what @oksanaguba said, the branch we are using heremachines/frontier was based off a scream repo of May 19th.

bartgol commented 1 year ago

From what @oksanaguba said, the branch we are using heremachines/frontier was based off a scream repo of May 19th.

Ah, yes! The PR that fixed the accumulation bug went in on May 25th, so this makes sense. If merging master is not doable, then the workaround is to use a restart frequence that coincides with the avg window size. If the restart happens on a model output step, May 19th repo should still give the correct avg.

crterai commented 1 year ago

Okay, the daily output from day 2 for this case (with new high rez SST file) looks reasonable for T_2m: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/t.maf-jun19.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.SSTocean.od/run/output.scream.daily.AVERAGE.ndays_x1.2019-08-02-00000.nc

This doesn't close the case for the NaN issue that Noel ran into, but just ensures that our runs look reasonable.

ndkeen commented 1 year ago

With updated repo, that we know at least corrects issue with monthly output, I have run ne1024 to 71 days so far. Not sure that proves we no longer see the NaN noted above. And not sure there is interest in trying to figure out if the longer run now is actually due to repo changes, or not.

E3SM-Project / scream

NaN in T_2m for ne1024 on frontier #2391