NaN check for field surf_radiative_T with ne120 and ne256 monogrid

ndkeen commented 1 year ago

Using ne120 monogrid, and the recently created 128 vertical level IC, I tried to run longer on pm-gpu, but it failed

I have two cases that fail the same way, one with 43 nodes and one using 85 nodes. Both stop at model date = 00010630, step 17336.

First I see the Tl1_1 has 1 values <= allowable value, immediately followed by the CAAR warnings:

  0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
119: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
106: WARNING:CAAR: k=128,theta(k)=99.951741<100.000000=th_thresh, applying limiter
106: WARNING:CAAR: k=128,theta(k)=99.524598<100.000000=th_thresh, applying limiter
106: WARNING:CAAR: k=128,theta(k)=98.575935<100.000000=th_thresh, applying limiter

And then first Error! message:

106:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se53-feb13/components/eamxx/src/share/atm_process/atmosphere_process.cpp:307: FAIL:
106: false
106: Error! Failed post-condition property check (cannot be repaired).
106:   - Atmosphere process name: SurfaceCouplingImporter
106:   - Property check name: NaN check for field surf_radiative_T
106:   - Atmosphere process MPI Rank: 106
106:   - Message: FieldNaNCheck failed.
106:   - field id: surf_radiative_T[Physics PG2] <double:COL>(2008) [K]
106:   - entry (179399)
106:   - lat/lon: (-32.700373, 291.937718)
106:
106:  *************************** INPUT FIELDS ******************************
106:
106:   ------- INPUT FIELDS -------
106:
106:  ************************** OUTPUT FIELDS ******************************
106:      T_2m<COL>(2008)
106:
106:   T_2m(358)
106:     147.481,
106:  -----------------------------------------------------------------------
106:      qv_2m<COL>(2008)
106:
106:   qv_2m(358)
106:     1.99301e-07,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dif_nir<COL>(2008)
106:
106:   sfc_alb_dif_nir(358)
106:     0.222123,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dif_vis<COL>(2008)
106:
106:   sfc_alb_dif_vis(358)
106:     0.133637,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dir_nir<COL>(2008)
106:
106:   sfc_alb_dir_nir(358)
106:     0.216921,
106:  -----------------------------------------------------------------------
106:      sfc_alb_dir_vis<COL>(2008)
106:
106:   sfc_alb_dir_vis(358)
106:     0.133422,
106:  -----------------------------------------------------------------------
106:      snow_depth_land<COL>(2008)
106:
106:   snow_depth_land(358)
106:     0.000135651,
106:  -----------------------------------------------------------------------
106:      surf_evap<COL>(2008)
106:
106:   surf_evap(358)
106:     -5.29259e-07,
106:  -----------------------------------------------------------------------
106:      surf_lw_flux_up<COL>(2008)
106:
106:   surf_lw_flux_up(358)
106:     -10.8045,
106:  -----------------------------------------------------------------------
106:      surf_mom_flux<COL,CMP>(2008,2)
106:
106:   surf_mom_flux(358,:)
106:     -8.29575, -2.37998,
106:  -----------------------------------------------------------------------
106:      surf_radiative_T<COL>(2008)
106:
106:   surf_radiative_T(358)
106:     -nan,
106:  -----------------------------------------------------------------------
106:      surf_sens_flux<COL>(2008)
106:
106:   surf_sens_flux(358)
106:     3027.75,
106:  -----------------------------------------------------------------------
106:      wind_speed_10m<COL>(2008)
106:
106:   wind_speed_10m(358)
106:     30.1183,
106:  -----------------------------------------------------------------------

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se53-feb13/t.se53-feb13.F2010-SCREAMv1.ne120pg2_ne120pg2.pm-gpu.n043t4xX.L128

ndkeen commented 1 year ago

With ne256 monogrid, which we were already using, I also see similar error. It happens approximately similar model date as well model date = 00010525 (with ne120 failing about a month after this).

  0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
311: WARNING: Tl1_1 has 1 values <= allowable value.  Resetting to minimum value.
285: WARNING:CAAR: k=128,theta(k)=99.975185<100.000000=th_thresh, applying limiter
285: WARNING:CAAR: k=128,theta(k)=99.975185<100.000000=th_thresh, applying limiter
285: WARNING:CAAR: k=128,theta(k)=99.723051<100.000000=th_thresh, applying limiter
...
236:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se53-feb13/components/eamxx/src/share/atm_process/atmosphere_process.cpp:307: FAIL:
236: false
236: Error! Failed post-condition property check (cannot be repaired).
236:   - Atmosphere process name: SurfaceCouplingImporter
236:   - Property check name: NaN check for field surf_radiative_T
236:   - Atmosphere process MPI Rank: 236
236:   - Message: FieldNaNCheck failed.
236:   - field id: surf_radiative_T[Physics PG2] <double:COL>(4096) [K]
236:   - entry (815857)
236:   - lat/lon: (-33.207194, 291.181688)
236:
236:  *************************** INPUT FIELDS ******************************
236:
236:   ------- INPUT FIELDS -------
236:
236:  ************************** OUTPUT FIELDS ******************************
236:      T_2m<COL>(4096)
236:
236:   T_2m(3696)
236:     152.521,
236:  -----------------------------------------------------------------------
236:      qv_2m<COL>(4096)
236:
236:   qv_2m(3696)
236:     1.44605e-06,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dif_nir<COL>(4096)
236:
236:   sfc_alb_dif_nir(3696)
236:     0.221189,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dif_vis<COL>(4096)
236:
236:   sfc_alb_dif_vis(3696)
236:     0.120901,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dir_nir<COL>(4096)
236:
236:   sfc_alb_dir_nir(3696)
236:     0.215391,
236:  -----------------------------------------------------------------------
236:      sfc_alb_dir_vis<COL>(4096)
236:
236:   sfc_alb_dir_vis(3696)
236:     0.12062,
236:  -----------------------------------------------------------------------
236:      snow_depth_land<COL>(4096)
236:
236:   snow_depth_land(3696)
236:     0.000192775,
236:  -----------------------------------------------------------------------
236:      surf_evap<COL>(4096)
236:
236:   surf_evap(3696)
236:     -6.77864e-06,
236:  -----------------------------------------------------------------------
236:      surf_lw_flux_up<COL>(4096)
236:
236:   surf_lw_flux_up(3696)
236:     -48.6872,
236:  -----------------------------------------------------------------------
236:      surf_mom_flux<COL,CMP>(4096,2)
236:
236:   surf_mom_flux(3696,:)
236:     -33.4984, -10.6215,
236:  -----------------------------------------------------------------------
236:      surf_radiative_T<COL>(4096)
236:
236:   surf_radiative_T(3696)
236:     -nan,
236:  -----------------------------------------------------------------------
236:      surf_sens_flux<COL>(4096)
236:
236:   surf_sens_flux(3696)
236:     5089.09,
236:  -----------------------------------------------------------------------
236:      wind_speed_10m<COL>(4096)
236:
236:   wind_speed_10m(3696)
236:     61.7287,
236:  -----------------------------------------------------------------------
236:
236:
236: Program received signal SIGABRT: Process abort signal.
236:
236: Backtrace for this error:

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se53-feb13/t.se53-feb13.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.long

ndkeen commented 1 year ago

Back to the ne120 case. I wanted to get a restart closer to the crash. First I restarted from restart files in the case for 1 month to reach model date = 00010601. Then I switched to using ndays instead of nmonths as the unit and asked to run for 30 days with restart every 10th day. It actually completed and ran further than previous attempt. ? model date = 00010701. Then I tried to run a few more days and it fails at model date = 00010706 with a different error:

106: WARNING:CAAR: k=127,theta(k)=75.206669<100.000000=th_thresh, applying limiter
106: Bad dphi, dp3d, or vtheta_dp; label: 'Vertical remap: Negative (or nan) layer thickness detected, aborting!'; see hommexx.errlog.172.106
106: Exiting...
106: MPICH ERROR [Rank 106] [job id 5688967.0] [Fri Feb 24 10:05:54 2023] [nid008469] - Abort(101) (rank 106 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 106
106:
106: aborting job:
106: application called MPI_Abort(MPI_COMM_WORLD, 101) - process 106
106: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
106: ERROR: For the pool allocator labeled "\360\244^^W^@^@^@^@AKL's primary memory pool":
106: ERROR: Trying to free an invalid pointer
106: This means you have either already freed the pointer, or its address has been corrupted somehow.
106: terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
106:
106: Program received signal SIGABRT: Process abort signal.
106:
106: Backtrace for this error:
106: #0  0x14b77a4d4d6f in ???
106: #1  0x14b77a4d4cdb in ???
106: #2  0x14b77a4d6374 in ???
106: #3  0x14b78970b608 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv
106:    at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/vterminate.cc:95

ndkeen commented 1 year ago

I tried another ne120 case using Feb24th repo that also fails in same place model date = 00010630. And is BFB with the cases noted above thru the last nstep dump of nstep= 199980.

The error is:


210: [DIRK] WARNING! Newton reached max iteration count, with deltaerr = 35858.08878763050597627
210: Bad dphi, dp3d, or vtheta_dp; label: 'DIRK Newton loop np1'; see hommexx.errlog.340.210
210: Exiting...
210: MPICH ERROR [Rank 210] [job id 5707327.0] [Sat Feb 25 14:48:03 2023] [nid008449] - Abort(102) (rank 210 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 102) - process 210
210:
210: aborting job:
210: application called MPI_Abort(MPI_COMM_WORLD, 102) - process 210
210: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
210: ERROR: For the pool allocator labeled "P\221\350#^@^@^@^@AKL's primary memory pool":
210: ERROR: Trying to free an invalid pointer
210: This means you have either already freed the pointer, or its address has been corrupted somehow.
210: terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'

label: DIRK Newton loop np1
time-level 0
lat -5.876747062738510e-01 lon  5.052728184523584e+00
ie 69 igll 3 jgll 3 lev 0: bad dphi
level                   dphi                   dp3d              vtheta_dp
    0                   -nan  8.125531673431398e+00  1.006722367006377e+04

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se55-feb24.gnugpu.n085a4xX.L128.coldTout

ndkeen commented 1 year ago

Regarding trying to make a restart file closer to the crash, I think I was tripped up again by https://github.com/E3SM-Project/scream/issues/2110 when I tried to change the units. It may be we should disregard any CONTINUE run that was done after changing units. I can stick with nmonths and try again.

ndkeen commented 1 year ago

I'm also not sure I'm allowed to change the output for a restarted run. That is, run for 6 months, then try to run another month with output changed to:

Fields:
  Physics PG2:
    Field Names:
    - ps
    - omega
    - horiz_winds
    - T_mid
    - qv
    - qc
    - qr
    - qi
    - tke
    - surf_sens_flux
    - surf_evap
    - surf_radiative_T
    - precip_liq_surf_mass
Max Snapshots Per File: 1
output_control:
  Frequency: 1
  MPI Ranks in Filename: false
  Timestamp in Filename: true
  avg_type_in_filename: true
  frequency_in_filename: true
  frequency_units: nsteps

So, just to make sure we make some progress with this, I started a new case with all of this extra output from day 1. Yes, it is way too much (~25TB), but we can hopefully learn something from it and delete what we don't need, while I also try smaller cases to verify if I can change output streams on restart.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se55-feb24.gnugpu.n043a4xX.L128.coldTout.extra

It fails at exact same location, with exact same error message (as expected).

With 43 pm-gpu nodes, adding this extra output only runs at 75% performance of previous run with "normal" output, which is maybe impressive.

bartgol commented 1 year ago

@ndkeen when restarting output, the only changes you are allowed to do are:

changes in file specs (xyz_in_filename or max snaps per file)
remove some vars, meaning that the set of output vars in the restarted run is a subset of the one in the original output

Any other change is likely not going to work, possibly not even generating an error.

However, you can add new streams. The AD will still try to restart them though, so you must set

Restart:
  Perform Restart: false

in the yaml file for that stream. You can also set this for old streams, in which case you can change as many parameters as you want since, for all practical purposes, it's like a completely new stream.

ndkeen commented 1 year ago

I was following what @crterai had suggested, but I don't think either of us realized these restrictions. And I'm not sure I follow.

bartgol commented 1 year ago

Yes, we don't have a user guide for model output (todo list!). I saw I'm also not sure I'm allowed to change the output for a restarted run., and thought I added some comments.

Bottom line is that a restarted run cannot change how/what it writes to file (with the exception that you can remove some vars). If you need to change, you need to "not restart" the output by setting Perform Restart: false.

crterai commented 1 year ago

Started looking at the single timestep level output data from the ne120 simulation that @ndkeen ran, and this is what the global minimum temperature for T_mid@bot does leading up to the crash. Note that the x-axis should be 'Steps' rather than days (thanks Noel for catching this!).

Location of that global minimum T in the last timestep matches up with the lat/lon given by the error message: [-32.70037346,291.93771758]. Which is just east of the Andes:

So this cold T also occurs adjacent to topography. Finally, a gif that shows the animation of bottom level variables (T_mid, qc, qr, and qi) in the last day leading up to the crash. https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne120_Tbot_AND_q_CrashPoint.gif

crterai commented 1 year ago

And this is the dynamics field in the bottom level for the same time period. https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne120_Tbot_AND_DYN_CrashPoint.gif The noise that starts to show up in the U and V field towards the end of the animation are a bit unphysical looking. @mt5555 or @oksanaguba, any thoughts? Note that this is all on the pg2 grid - not the dynamics grid.

ndkeen commented 1 year ago

Here is an attempt to plot the min T_mid per vertical level -- this is just last few steps before the crash. I mislabeled "Level1" -- should be "Level126" all

whannah1 commented 1 year ago

@ndkeen can you redo that plot with only the 10 or 20 lowest levels?

ndkeen commented 1 year ago

Sure. Note I mislabeled one curve above as Level1 -- all the cold T lines are happening near the ground (highest vert levels). This would be the ~15 curves nearest the surface.

Note to parse the data, I'm using

fx=xarray.open_mfdataset(nc)
T_mid_min=fx['T_mid'][:,:,127]
T_mid_min_timeseries=np.min(T_mid_min,axis=1)

ie, that's what I an labelling Level127

lastday-bottom

whannah1 commented 1 year ago

Since Level 126 is leading the charge it seems safe to say that we need to be looking away from the lowest level to figure this out.

PeterCaldwell commented 1 year ago

Agreed. I'm curious whether the U and V changes lead the T change (which could be investigated by plotting the timeseries for all 3 variables divided by their pre-coldT ave value for the cell that eventually crashed).

I also find it odd that there's a clear front in V that propagates from the bottom left of the figure and the model seems to crash when this front reaches the problem area. It also looks to me (though it's hard to tell) like omega doesn't get weird - just horizontal winds.

mt5555 commented 1 year ago

Latest results I think remain consistent with the old hypothesis that this is a lack of vertical mixing in stable boundary layer conditions:

We see a potentially similar behavior in HS+topo, https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3599073307/Cold-T+like+issue+in+Held+Suarez (which has no physics - suggesting it's a dycore phenomena not triggered by physics, but some boundary layer mixing is needed to control it)
we see a very similar thing in EAM at NE256 (and occasionally in the RRM in high res region - see: https://github.com/E3SM-Project/E3SM/issues/5089 )
It was controlled in EAM by increasing the backround vertical diffusion in CLUBB ( by reducing lmin_coef from .1 to 0.025)
\~~It would be nice to try adjusting the similar SHOC parameter: decrease Scalar Ckh_s_max and Scalar Ckm_s_max from .1 to 0.025 \~~ ignore this! see @bogensch comments below

Based on the RRM results in E3SM#5089 above, a workaround which doesn't fix the problem but prevents the run from crashing could be to increase the theta limiter min value from 100K to 200K

PeterCaldwell commented 1 year ago

consistent with the old hypothesis that this is a lack of vertical mixing in stable boundary layer conditions

We need to be a bit careful with wording here. I'm sure that with enough vertical mixing we won't have coldT problems, but that doesn't mean that the deficiency in our simulations is vertical mixing... just that it's a convenient patch. I'd really like to understand the actual source of trouble even if we end up using artificial diffusion to fix it.

bogensch commented 1 year ago

lmin_coef in CLUBB and Ckm_s_max in SHOC are two completely different parameters and NOT comparable. Reducing Ckh_s_max in SHOC will for certain lead to more frequent cold T.

whannah1 commented 1 year ago

@bogensch @PeterCaldwell is it really even possible to have this rate of cooling resulting from a stable layer? The temperature drop seems extreme, and at one point I think I did a rough estimate that suggested the cooling rate was much higher than the longwave cooling rate.

Or maybe I'm misunderstanding how the stable layer problem works... is longwave cooling the primary cooling mechanism in a situation where a stable layer completely shuts down turbulent mixing?

PeterCaldwell commented 1 year ago

Interesting point, Walter. I think there are 2 different things going on here. You're right to ask how the layer is cooling so rapidly - is it LW cooling, evaporation/sublimation, mass flux divergence (unlikely), negative surface fluxes, or ???. Looking at the energy budget for this layer would be very useful for answering this. The fact that we're in a stable BL regime is kind of independent of the reason for cooling... it only matters in the sense that a regime with more vertical mixing wouldn't allow the surface layers to get so cold.

mt5555 commented 1 year ago

Are you sure "mass flux divergence" is unlikely? I think that's what it would be in the case of Held-Suarez forcing (although we haven't actually confirmed the HS coldT issues are the same as this one).

From our earlier results, increasing horiztontal diffusion does help quite a bit. I'm against that, probably for the same reason Pete is against increasing the vertical diffusion :-) Its so isolated and localized, it would be a shame to crank up the diffusion globally (vertical or horizontal).

whannah1 commented 1 year ago

I recall from earlier analysis that the surface heat fluxes were acting to oppose the cooling, which makes me think that rad and micro (and maybe dynamics) are the prime suspects.

crterai commented 1 year ago

I think agree with Walter that all cold T's we've analyzed so far have not been driven by the surface (due to surface heat fluxes) - SHF generally acted to counteract the cold air temperature. And I suspect rad is not the culprit, unless something is horribly wrong, since longwave cooling typically acts to cool from a warmer place to a cooler place and nothing close to that area surface seems to be getting colder than the air itself. So I think that microphysics or dynamics is the likely cause. In the v0 model, I believe we were seeing that there was a large amount of divergence that was causing the cooling. I've created a height-time cross-section of the column that experiences the coldest T in the second to bottom layer. It looks like the strong cooling precedes the cloud water formation and there's a distinct convergence/divergence signal that corresponds with the cooling (from the omega field), so I'm guessing that it's the dynamics again in this case. And associated convergence/divergence from domega/dP

whannah1 commented 1 year ago

@crterai that's a nice result! Can you zoom in on time 40-60?

crterai commented 1 year ago

I've changed the T diff color bars to a wider color range.

PeterCaldwell commented 1 year ago

I've hidden my atm sci PhD so you can't take it away from me for asking such a dumb question... I understand from the ideal gas law how decreasing pressure decreases temperature, so it makes intuitive sense that mass flux divergence would cool a layer. I can't figure out how to convert Chris' new divergence plots into a cooling rate though. It seems like $p=\rho R_d T$ means that $dT$ should be proportional to $dp$ rather than $d\omega/dp$. Somewhat equivalently, I can't figure out whether divergence means that the mass of the column above is getting smaller, or whether a cell is literally being evacuated so it becomes cold like outer space. I think sigma-p coordinates imply the former.

whannah1 commented 1 year ago

I think at this point having a breakdown of the temperature tendencies from each process would be really valuable. How much work would it take to get that for dynamics, P3, and SHOC? @AaronDonahue @bartgol

bartgol commented 1 year ago

Mmm, it might take some time, but probably not too much. I don't have time to do it this week, but I can get to it on Monday. TBC, you mean adding something like

void MyAtmProc::run (double dt) {
   T_start = get_field_out("T_mid");
   run
   T_tend += (get_field_out("T_mid")-T_start) / dt;

right?

I think it can be done in half a day.

PeterCaldwell commented 1 year ago

@bartgol - yeah, I think your above pseudo-code is what we want to do. I was wondering whether we could add this to the AD layer rather than putting it in the interface layer for each proc (which I think is what you're doing above)... but ultimately it's fine if we add this sort of code for each process separately.

bartgol commented 1 year ago

Yeah, I can add it at the infrastructure level, triggered by some runtime flags (this causes using 2 extra fields for each tracked field, so must be explicitly requested). I will get to it on Monday (hopefully).

ndkeen commented 1 year ago

@golaz was wanting me to try continuing the run to reach a year anyway. I tried a case with BFBFLAG=FALSE (normally TRUE) and it fails in exactly the same way -- looks BFB for those 6 months, which likely means the BFBFLAG is not having an impact for these runs. Is there something else I might try just to get a year? Or is this cold T issue so severe that it wouldn't make sense anyway?

mt5555 commented 1 year ago

Noel: I think this would let us run longer: change vtheta_thresh from 100d0 to 200d0.

In the fortran code, its a parameter in src/share/control_mod.F90.

But I couldn't find it in the C++ code - it might be a namelist variable?

Edit: "vtheta_thresh" in EAMxx appears to come from the HOMME fortran namelist. Oksana suggested this might work: atmchange vtheta_thresh=200

oksanaguba commented 1 year ago

To clarify vtheta_thresh is probably not in namelist_scream.xml, if no one put it there (i did not). Noel could go and change it in control mod:

./share/control_mod.F90: real (kind=real_kind), public :: vtheta_thresh = 100.d0

and also xx code should print it to stdout.

ndkeen commented 1 year ago

Well this is what I'm trying:

In namelist_scream.xml

    <transport_alg type="integer">12</transport_alg>
    <vtheta_thresh type="real">200.0</vtheta_thresh> <-- added this line

And I see:

login36% zgrep vthet run/ho*
run/homme_atm.log.5859780.230302-104540.gz: readnl: vtheta_thresh     =    100.00000000000000     
run/homme_atm.log.5860545.230302-110906.gz: readnl: vtheta_thresh     =    100.00000000000000     
run/homme_atm.log.5861355.230302-113213.gz: readnl: vtheta_thresh     =    100.00000000000000     
run/homme_atm.log.5862777.230302-115339.gz: readnl: vtheta_thresh     =    100.00000000000000     
run/homme_atm.log.5868792.230302-153749: readnl: vtheta_thresh     =    200.00000000000000

ndkeen commented 1 year ago

I ran the ne120 for 13 months after setting vtheta_thresh=200. This was done at the beginning of the run (ie as opposed to reading in restart and changing the threshold). /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/t.se55-feb24.F2010-SCREAMv1.ne120pg2_ne120pg2.pm-gpu.n085t4xX.L128.vth200 This case was primarily for Chris G to look at output for general sanity checks -- that is, not trying to debug the crash here, we just wanted to get ~year of ne120 data, but I thought it might be informative to know that the case will run further with increased threshold.

Meanwhile, Chris T asked me to run our ne120 case with different output up to the crash (back to normal vtheta threshold again). I made that run here: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se55-feb24.gnugpu.n043a4xX.L128.CTout00b

The output used was:

Casename: ${CASE}.scream.hi
Averaging Type: Instant
# One output every 31 days if output frequency is set to once per hour
Max Snapshots Per File: 1
Fields:
  Physics ${PHYSICS_GRID_TYPE}:
    Field Names:
      - sgs_buoy_flux
      - LW_flux_dn
      - LW_flux_up
      - rad_heating_pdel
      - pseudo_density
      - surf_sens_flux
      - T_mid
output_control:
  Frequency: 1
  frequency_units: nsteps
  MPI Ranks in Filename: false
  Timestamp in Filename: true
  avg_type_in_filename: true
  frequency_in_filename: true

I'm also having trouble with restarting that impacted my work for both of these efforts. Documenting that here https://github.com/E3SM-Project/scream/issues/2110

ndkeen commented 1 year ago

Has anyone had a chance to look at the output for the last run (case noted in above comment)? It is using 17TB of space so didn't want to let it get too old.

crterai commented 1 year ago

I did a quick check of the radiative heating from the ne120 case that crashed on June 30th. And while it doesn't seem to be the case that the initial cooling is caused by the radiative heating (bottom row), there is a strong cooling that does occur in the lower atmosphere, flanked by warming layers. If we compare with the actual timestep-to-timestep temperature differences (top row of plot below), the radiative temperature tendencies (including the cooling) are larger in magnitude (bottom row of plot below). However, it's not clear whether the radiative cooling is located in the same location as the actual cooling, so I'll need to look at this more carefully.

ndkeen commented 1 year ago

It was suggested that maybe I could run a case with se_ftype=1, but reports of this not being an option. Wanted to verify this:

login38% atmquery se_ftype
    namelist_defaults::ctl_nl::se_ftype: 2
login38% atmchange se_ftype=1
ERROR: Invalid value '1' for element 'se_ftype'. Value not in the valid list ('[0, 2]')

bartgol commented 1 year ago

I wasn't aware ftype=1 was an option. Our current XML, as you noticed, checks to ensure it's either 0 or 2. @oksanaguba migh comment on whether we should allow atmchange to set it to 1.

ndkeen commented 1 year ago

Noting that I continued the ne120 run with vtheta_thresh=200 for another 2 years. It completed, for a total of about 3 years. last date=0004-02-01. This is with same repo as the first year -- ie from Feb24th. It looks like it is reading the correct restart file dates.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se55-feb24/t.se55-feb24.F2010-SCREAMv1.ne120pg2_ne120pg2.pm-gpu.n085t4xX.L128.vth200

ndkeen commented 1 year ago

I ran a ne120 case on pm-cpu. By accident, I did not set vtheta_thresh, which means it ran with default value of 100. The case ran beyond the date at which it was stopping above with pm-gpu (model date = 00011205). I will continue this run further, as well as try to continue a run with vtheta_thresh=200. These are cases with output settings Chris G requested, so may not be ideal for looking at cold T.

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se59-mar13/t.se59-mar13.F2010-SCREAMv1.ne120pg2_ne120pg2.pm-cpu.n085t128x1cX.L128

The CPU case eventually does fail, but not with the same error as here. https://github.com/E3SM-Project/scream/issues/2252

ndkeen commented 1 year ago

Using scream of March 20th, and including fix for restarts, I have a case on pm-gpu that does stop at the same date as above (0630), and writes restart files 2 days before at (0628). Now, we need to change the yaml output and try to restart from this point. What would be a good set of outputs to use? -- Update -- we figured this out and I restarted with increased outputs (exact steps noted below)

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se60-mar20/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se60-mar20.gnugpu.n043a4xX.L128.modup.rest178days

login27% cat spiffy2.yaml 
filename_prefix: ${CASE}.scream.tendency
Averaging Type: Instant
Max Snapshots Per File: 1
Fields:
  Physics ${PHYSICS_GRID_TYPE}:
    Field Names:
      - T_mid
      - p3_T_mid_tend
      - shoc_T_mid_tend
      - rrtmgp_T_mid_tend
      - homme_T_mid_tend
output_control:
  Frequency: 1
  frequency_units: nsteps
  MPI Ranks in Filename: false
  Timestamp in Filename: true
  avg_type_in_filename: true
  frequency_in_filename: true

./atmchange Scorpio::output_yaml_files=spiffy2.yaml

login27% atmquery Scorpio::output_yaml_files
    Scorpio::output_yaml_files: spiffy2.yaml

./atmchange physics::mac_aero_mic::shoc::compute_tendencies=T_mid,qv
./atmchange physics::mac_aero_mic::p3::compute_tendencies=T_mid,qv
./atmchange physics::rrtmgp::compute_tendencies=T_mid
./atmchange homme::compute_tendencies=T_mid,qv

./xmlchange CONTINUE_RUN=TRUE

login27% ./xmlquery STOP_N,STOP_OPTION,REST_N,REST_OPTION,CONTINUE_RUN,ATM_NCPL,DEBUG,BFBFLAG,SCREAM_CMAKE_OPTIONS,SCREAM_HACK_XML

Results in group build_component_scream
        SCREAM_CMAKE_OPTIONS: SCREAM_NP 4 SCREAM_NUM_VERTICAL_LEV 128 SCREAM_NUM_TRACERS 10
        SCREAM_HACK_XML: FALSE

Results in group build_def
        DEBUG: FALSE

Results in group run_begin_stop_restart
        STOP_N: 182
        STOP_OPTION: ndays
        REST_N: 89
        REST_OPTION: ndays
        CONTINUE_RUN: TRUE

Results in group run_coupling
        ATM_NCPL: 96

Results in group run_flags
        BFBFLAG: TRUE

case.submit

crterai commented 1 year ago

Here's plot from the output from the latest run that @ndkeen ran. First, a timeseries of the delta T_mid from timestep to timestep for the grid box that experiences the coldest T in L127 (2nd from bottom level) in the last available timestep output. 1) It's odd that the SUM of processes does not add up to be the delta T_mid, but 2) overall, even with the lack of closure, it looks quite convincingly that the temperature tendency in homme is the largest contributor to the temperature change. Sanity check: we're not dribbling any temperature tendencies from physics in homme, are we (I always mix up what the different ftypes connote)?

And below, the same data, but looking at the the whole column temperature tendencies. Again, tendencies from homme dominate the temperature tendencies.

PeterCaldwell commented 1 year ago

Great work Chris and Noel. It's not often I can say "it's the dycore's fault!" and actually have data to back myself up ;-). Though we can't rule out the problem being that the physics which is supposed to compensate the dycore tendency not working correctly. A couple of things:

@ndkeen - did this run crash with the "NaN in the surface T" error? I want to make sure the run actually crashed rather than just completing the requested number of steps (i.e. is this a fatal or recoverable event)
I'm surprised RRTMGP isn't doing anything at all. Do we not have clouds in this case? Relatedly, P3 seems to have weird 3*dt linearly-increasing spike in condensational heating. I wonder what that's about.
I agree that it's odd that the net tendencies occasionally don't add up to the total temperature change. I wonder if some clipping operator (post-condition check/repair?) is acting in that case. Do we know whether post-condition checks would be included in the process tendency?
If I did my math right, 30K of cooling in a 15 min timestep is 2,880K per day... which is ridiculously big.

ndkeen commented 1 year ago

Yes, this case failed the same way (same date, same error) as other pm-gpu ne120 cases.

whannah1 commented 1 year ago

@crterai Are we sure that the units for the radiation tendency are correct and not normalized by dp?

crterai commented 1 year ago

@whannah1 & @PeterCaldwell - here's a plot with both qc and qi and the radiation tendencies with a narrower colorbar for the same column. So there are clouds, both qc and qi. And the units for the rrtmgp temp tendencies say s^-1 K and I just applied the same operations to the rrtmgp as I did for all the other data. When we stretch the grid to colorbar, we can see some warming and cooling (I think consistent with how the clouds are), but the magnitudes are smaller in this case.

crterai commented 1 year ago

@PeterCaldwell - it looks like the dt*3 heating is from SHOC, rather than P3. So I think it's the surface fluxes. Not sure why there's this oscillation - I haven't checked surface wind speeds, etc.

ndkeen commented 1 year ago

It was suggested (@whannah1 ) that I try adjusting rad_frequency to see if it changes how the ne120 problem fails. I tried 2 cases, one with radf=1 and another with radf=8 (default is radf=4). Both of these cases are NOT stopping at the same date as noted above -- the continue further.

For radf1: ./atmchange rad_frequency=1 /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se61-mar23/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se61-mar23.gnugpu.n043a4xX.L128.radf1

Stops with similar NaN as noted above at model date = 00020125

For radf8:

./atmchange rad_frequency=8 /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se61-mar23/f120.F2010-SCREAMv1.ne120pg2_ne120pg2.se61-mar23.gnugpu.n043a4xX.L128.radf8

Completed 2 years model date = 00030101

bartgol commented 1 year ago

Re: units. All processes compute tendencies for field X as X_tend = (X_after - X_before) / dt, so units are [X]/s.

Re: sum of tendencies not matching the overall tendency. That's puzzling, and might deserve some digging.

I would be cautious saying that "it's the dycore's fault". It might be interesting to see the behavior of other state vars before the crash happens. Perhaps T going bananas is just a downstream effect of other state var becoming unstable.

E3SM-Project / scream

NaN check for field surf_radiative_T with ne120 and ne256 monogrid #2183