Test `SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods` seems to rely on (a bug in) gustiness for stability

quantheory commented 1 year ago

When I was running the developer tests for #5850, which I recently rebased onto master, the test SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods is now crashing. After spending all of yesterday trying to figure this out, I realized that actually, this test can easily be made to fail on the current master by simply turning atmospheric gustiness off. (I originally did this by setting vmag_gust = 0 in clubb_intr.F90. Setting use_sgv = .false. in the EAM namelist also causes the failure.)

The failure takes the form of an invalid operation in ELM, which immediately crashes DEBUG runs, and poisons the state with a NaN that causes a crash without DEBUG. This is because the eflx_lwrad_out is negative, i.e. the upward longwave radiation has the wrong sign. (Possibly due to large temperature swings? Temperature drops by ~10K in the grid cells that have negative eflx_lwrad_out) The diagnosed surface temperature is proportional to the fourth root of this quantity, so a negative value results in NaN and a crash. The eflx_lwrad_out becoming negative actually happens in every case (even on master) within 3-4 time steps, or at least so it appears if I print it out in SoilFluxesMod. But that negative value doesn't always cause an immediate crash. (I have no idea why.) For some commits I tested, the runs crash very early, whereas if I just remove gustiness on master, it crashes near the end of the 5 day test. (This suggests that if this test was run for longer, it might actually crash on master as well!)

To me, that seems to imply that with the recent EAMv3 changes, this test case is very close to the edge of stability, close enough that removing surface gustiness can make the run unstable. I don't think it's really good to depend on the gustiness for stability in general (and as I mentioned, master may be crashing anyway if we run some of these tests longer). But I'm not sure what v3 changes have caused this problem, so I have no particular ideas about what to do.

This issue may technically not be causing crashes on master, but it is blocking #5850, so I'm filing it as a bug here.

Randomly tagging @wlin7 @mabrunke @beharrop @bishtgautam as people that may have some ideas here. I'm not sure what to do, or if this is really an EAM or an ELM problem ultimately.

Edit: I should have given this quick 3-line way to reproduce the issue, assuming that the current directory is the E3SM source root:

mkdir components/eam/cime_config/testdefs/testmods_dirs/eam/no_sgv/
echo "clubb_use_sgv = .false." >components/eam/cime_config/testdefs/testmods_dirs/eam/no_sgv/user_nl_eam
cd cime/scripts; create_test SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.eam-no_sgv

rljacob commented 1 year ago

Only that case and only on that machine/compiler ?

quantheory commented 1 year ago

@rljacob Only that case that I've found so far. Let me check on machine/compiler. I doubt it's machine-specific, but we'll see.

quantheory commented 1 year ago

@rljacob Reproduced this with SMS.ne4_oQU240.WCYCL1850NS.compy_intel, so neither the machine nor the PE layout matter.

mabrunke commented 1 year ago

@quantheory This is interesting. I was able to run the model for a month with my own similar set of bug fixes. Were you able to replicate the crash in a normal run of the model?

quantheory commented 1 year ago

@mabrunke I haven't tried a longer run yet, and like I said, this is the only test case I've had fail so far. Furthermore, this test was not failing two weeks ago. It's only when I rebased to include changes made since the beginning of August that the failures started, which suggests that maybe something in the v3 atm features is involved?

quantheory commented 1 year ago

@rljacob The test passes with GNU compiler on perlmutter, so maybe this is compiler-specific. Extended the length to a 30 day run and it still didn't manage to crash.

rljacob commented 1 year ago

Try adding the debugging flag. SMS_D_P12x2.....

quantheory commented 1 year ago

@rljacob

Try adding the debugging flag. SMS_D_P12x2.....

That lets the GNU test run, but also lets the Intel test run on perlmutter. (But earlier Intel was still failing with DEBUG on perlmutter. The two things that have changed since then are that I'm running with #5876 merged, and turning off gustiness in a slightly different way.)

I think I need to be more systematic about this and make a table of different runs with more precisely controlled differences. But right now I can tell you this:

The failure occurs within the first few time steps (<12 hours) about half the time, but is otherwise often near the end of the 5 day test. This is the single most annoying part of the bug, because it implies that a successful 5 day run might have crashed if it was run for 6 days. I've been sprinkling in ”_Ld10" and "_Ld30" tests to try to deal with this, though in the few times I've tried that, they have always passed if the 5 day test passes.
I have no failures with other compsets (e.g. F compsets).
I have no failures from branches that started from master 2 weeks ago (though did not test this compset much before then). I discovered this issue after rebasing onto current master. It seems likely that one of the answer changing PRs merged since August 1 caused the issue. I have not yet bisected to try to find the change responsible.
I don't yet have a failure with a compiler other than Intel.
I do have failures both with and without DEBUG on Intel.
I don't yet have a failure with the original settings on master (on any complier, machine, PE layout, run length, or DEBUG setting).
But every change I've made that reduces gustiness (including multiple methods of turning it off) has resulted in a crash at least in some configuration.
Some tests have been repeated and fail at the exact same time step with similar results both times, which argues against a truly "stochastic" cause like a race condition.
The cause of failure in DEBUG runs is always that surface longwave emissions become negative, and then cause an error when trying to take the fourth root to diagnose a temperature. So far this appears to occur exclusively in grid cells with urban land. However, this mysteriously does not cause a crash on master with default gustiness settings.
For non-DEBUG runs, the crash occurs in a shr_reprosum_mod call in the EAM dycore, and is due to a large number of NaNs in some field. This is consistent with the possibility that the aforementioned longwave radiation issue is poisoning the state sent to the atmosphere, though not absolute proof.
I do have failures both with and without #5876 merged; I doubt that it matters.

quantheory commented 1 year ago

Addendum: Since the GNU and Intel DEBUG tests both passed in 5 day runs on Perlmutter, I ran both tests again with _Ld10. The GNU test passed again, while the Intel test now failed on day 4. This undermines what I said before; maybe there a stochastic error like a race condition involved here.

ndkeen commented 1 year ago

Just confirming I see error SMS_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods using recent master. (and pasting error message in case it's searched on) Though SMS_Ld10_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods did complete.

  1: SHR_REPROSUM_CALC: Input contains  0.24600E+03 NaNs and  0.00000E+00 INFs on MPI task       1
  1:  ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
  1: Image              PC                Routine            Line        Source             
  1: e3sm.exe           0000000003A6715D  shr_abort_mod_mp_         114  shr_abort_mod.F90
  1: e3sm.exe           0000000003B8A538  shr_reprosum_mod_         644  shr_reprosum_mod.F90
  1: e3sm.exe           0000000001B2002F  compose_repro_sum         436  compose_mod.F90
  1: e3sm.exe           00000000016440C8  operator()                370  compose_cedr.cpp
  1: e3sm.exe           0000000001646ECF  run_horiz_omp              21  compose_cedr_caas.cpp
  1: e3sm.exe           0000000001655E20  run_global<Kokkos          48  compose_cedr_sl_run_global.cpp
  1: e3sm.exe           00000000015FB14E  sl_advection_mp_p         258  sl_advection.F90
  1: e3sm.exe           00000000015D1AC5  prim_driver_base_        1428  prim_driver_base.F90
  1: e3sm.exe           0000000001B1046A  dyn_comp_mp_dyn_r         401  dyn_comp.F90
  1: e3sm.exe           000000000152307E  stepon_mp_stepon_         582  stepon.F90
  1: e3sm.exe           000000000053CD94  cam_comp_mp_cam_r         352  cam_comp.F90
  1: e3sm.exe           000000000052C582  atm_comp_mct_mp_a         583  atm_comp_mct.F90
  1: e3sm.exe           0000000000446D7E  component_mod_mp_         757  component_mod.F90
  1: e3sm.exe           0000000000426D34  cime_comp_mod_mp_        3112  cime_comp_mod.F90
  1: e3sm.exe           0000000000446A12  MAIN__                    153  cime_driver.F90

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me24-aug15/SMS_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods.rgh5884

The above test was using 128x1. These test also fail in what looks like same way:

SMS_P12x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
SMS_P64x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods

SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
SMS_P64x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel

And these pass:

SMS_D_P64x2_Ld5.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods
SMS_P64x2_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel
SMS_P64x2_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods

quantheory commented 1 year ago

@ndkeen Is this just using master out of the box, or have you made any change to gustiness or other settings?

ndkeen commented 1 year ago

Master of Aug15, no changes

quantheory commented 1 year ago

@ndkeen Thanks! This is a valuable hint, since it looks like the same bug, and this is the first case where anyone has seen it on master without any physics changes at all. (@jli628 and @wlin7 are helping me to debug this.)

rljacob commented 1 year ago

Wait why do you think its the same bug? Reprosum complaining about a NaN only means a NaN was produced somewhere, not by the same bug.

quantheory commented 1 year ago

@rljacob Mainly because all the DEBUG cases where I've encountered the issue fail at the same line in ELM (in the lnd2atm module). And I find it suspicious that it's this one particular test case that keeps failing. But you're right, it could be that some of the non-DEBUG runs are failing differently. I just find that less likely due to Occam's razor.

rljacob commented 1 year ago

But the error message Noel posted doesn't show this coming from ELM. And its not threaded. The only thing in common is the resolution and case.

ndkeen commented 1 year ago

It's true I actually don't know how Seans job failed, but I thought I would just try a few things and was assuming the fail I hit would be related -- could easily be something else. Note I added a few more fail/passes in my above comment. Does not seem to matter if .allactive-mach_mods is present. To narrow further, are there easy things to try instead of ne4_oQU240.WCYCL1850NS ?

quantheory commented 1 year ago

I'm now having trouble getting any run to fail with DEBUG enabled. (I guess the optimization changes in DEBUG have enough of an effect?) So I went into here: https://github.com/E3SM-Project/E3SM/blob/8d81d0b1ace84190545428cb197a116d60356c7c/components/elm/src/main/lnd2atmMod.F90#L120-L122

Line 121 there is where some DEBUG runs have crashed with the gustiness changes, due to a negative value of eflx_lwrad_out_grc(g) inside a sqrt call.

So, I added these lines just before line 121:

       if (eflx_lwrad_out_grc(g) < 0._r8) then
          print *, "At g = ", g, ", eflx_lwrad_out_grc = ", eflx_lwrad_out_grc(g)
          call endrun("bad eflx_lwrad_out_grc value")
       end if

And sure enough, the test SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods now crashes with:

43:  At g =          153 , eflx_lwrad_out_grc =   -3.79971951551529
 43:  ENDRUN:bad eflx_lwrad_out_grc value
 43:  ERROR: Unknown error submitted to shr_abort_abort.

So the error on master with no threading does seem to be the same as the error with the gustiness mods with threading. Or at least, it generates NaN in the same line of code.

wlin7 commented 1 year ago

Update of testing using the branch (https://github.com/quantheory/E3SM/tree/quantheory/gustiness-fixes-for-v3) for the gustiness PR #5850

Threading nbfb for coupled test. Limited steps completed by SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel SMS_D_P24x1.ne4_oQU240.WCYCL1850NS.pm-cpu_intel gave different results from step 2. Note that SMS_D_P36x1 results are bfb with SMS_D_P24x1.
PET_D_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_gnu also failed threading test, though it can run stably.
Threading issue does not exist with F cases (active atm and lnd, and mpassi prescribed sea ice mode). PET_Ld1_P12x2.ne4_oQU240.F2010.pm-cpu_intel. Also see further notes below on DEBUG mode.
SMS_D_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel first reported fatal floating invalid in lnd comp, while SMS_D_P24x1.ne4_oQU240.WCYCL1850NS.pm-cpu_gnu failed with NaN produced in physics_state by package cam_radheat. Being more familiar with atm debugging, focus on using pm-cpu_gnu for further testing.
One step before crashing, during step 3, chunk 117, state%t(ncol=12,72) first saw a cold T of 188.48 K at bottom level. T at the level above was normal at 254K. The grid is at (58.16N, 241E), lndfrac=1.0. The cold temperature was produced within macro-micro substepping - actually all accumulated from macrop (clubb) tendencies.
Upon entering macrop (clubb) substepping, anomalous values seen in surface data obtained from coupler. cam_in%ts=240.2 and cam_in%shf=-83.18. During the previous steps, cam_in%shf =~ -22 and cam_in%ts =~ 258, about 1 degree warmer than bottom level air temperature. (It is odd why shf to atm is negative when surface is warmer). The negative shf brought bottom air temperature down from 253.4K (before entering macmic substepping) to 188.46 after completing 6 steps of macrop (clubb) subcycling.
One step later, cam_in%ts became NaN, which was fed to clubb update, leading to NaN values in state%t on all levels at the column. The run would then proceed to report fatal error in due to NaN values produced by cam_radheat.
The error source apparently is not directly from cam_radheat; and the anomalies were triggered at least one physics (atm/lnd coupling) step earlier. How land processes returned a sudden drop in ts may hold the clue.

Note: The same cause could be responsible for #5955, and particularly #5957. Those tests use master branch without the new gustiness codes.

Further notes: threading non-bfb appear to only exist with DEBUG=TRUE. Test like PET_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel threading comparison is PASS, while PET_D_Ld1_P640x2.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel a FAIL. Same for pm-cpu_gnu. PET_D_Ld1_P12x2.ne4_oQU240.F2010.pm-cpu_intel also failed threading comparison, diff from non-DEBUG PET F2010 test.

quantheory commented 1 year ago

@wlin7 This is very interesting. It would be interesting to know some of the cam_out values produced immediately before cam_in%shf starts to become very negative. In particular, ubot, vbot, tbot, qbot, and ugust. Is this something you can readily provide for the test case you mention above?

If the SHF seems inconsistent with the temperatures, this could mean that the energy balance iteration in the land code is failing to converge. I could try increasing the iteration count for all of those, or specifically the ones over land, and see if that avoids the crashes.

wlin7 commented 12 months ago

It could be just the atm initial data problem. With a new atm IC remapped from ne30, the test can run without problem (file below on nersc) /global/cfs/cdirs/e3sm/inputdata/atm/cam/inic/homme/NGD_v3atm.ne30pg2_mapped_to_ne4np4.eam.i.0001-01-01-00000.c20230106.nc

The failure of other small grid tests on cdash, such as ERS.ne11_oQU240.WCYCL1850NS.pm-cpu_intel could be due to the same reason. To test with a new IC for ne11 as well.

More to follow.

wlin7 commented 12 months ago

It would be interesting to know some of the cam_out values produced immediately before cam_in%shf starts to become very negative. In particular, ubot, vbot, tbot, qbot, and ugust. Is this something you can readily provide for the test case you mention above?

Good point, @quantheory . I did print those in cam_out every step towards the end of tphysbc. This may become irrelevant now that a new IC can get the model to run. For record, the values do not look suspicious at step 2 (before having cam_in%ts drops and large negative cam_in%shf, which were seen at step 3). The first number at each line is nstep.

*** DEBUG post-cam_export *** psl/zbot/tbot:           2   101840.02055219682        11.240886502646099        254.90734439146630

*** DEBUG  post-cam_export *** ubot/vbot/ugust:           2  -2.4379296890397854E-013  -7.4041353829012924E-012   2.2714749141729875
*** DEBUG post-cam_export *** thbot,qbot,pbot:           2   255.01695655969468        7.4604089547123688E-004   93844.214121831232
*** DEBUG post-cam_export *** netsw,flwds:           2   0.0000000000000000        90.808306799755101
*** DEBUG post-cam_export *** precc,precl:           2   0.0000000000000000        1.1684427779309500E-005

quantheory commented 5 months ago

I notice that this issue is still open. Is anyone still investigating this, or should we close this, since updating the IC file seems to have fixed the issue?

quantheory commented 5 months ago

Coincidentally, I just found out today that we can still trigger this issue on maint-2.0 by both messing with the CLUBB time step and implementing gustiness changes. (Actually, it was @kchong75 who discovered this.) And it can happen well after initialization, so it's not just an IC issue.

I'm inclined to believe that this issue is due to some part of the land model that is just very close to being numerically unstable, rather than a straightforward bug, but if we find a way to make these crashes stop or become less likely, it may be worth making a PR...

E3SM-Project / E3SM

Test `SMS_P12x2.ne4_oQU240.WCYCL1850NS.pm-cpu_intel.allactive-mach_mods` seems to rely on (a bug in) gustiness for stability #5884