E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
73 stars 53 forks source link

"Cold T" error with ne256+SPA on pm-gpu #2029

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

Old issue regarding cold T was here https://github.com/E3SM-Project/scream/issues/1950

Using Nov 9th repo and using SPA, hit the following error with ne256 on pm-gpu after 11 days.

184: terminate called after throwing an instance of 'std::logic_error'
184:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se41-nov9/components/scream/src/share/atm_process/atmosphere_process.cpp:256: FAIL:
184: false
184: Error! Failed post-condition property check (cannot be repaired).
184:   - Atmosphere process name: Dynamics
184:   - Property check name: T_mid within interval [130, 500]
184:   - Atmosphere process MPI Rank: 184
184:   - Message: Check failed.
184:   - check name: T_mid within interval [130, 500]
184:   - field id: T_mid[Physics PG2] <double:COL,LEV>(4096,128) [K]
184:   - minimum:
184:     - value: 125.436
184:     - entry: (1507626,127)
184:     - lat/lon: (61.7833, 219.166)
184:   - maximum:
184:     - value: 289.385
184:     - entry: (1507627,125)
184:     - lat/lon: (61.5632, 219.206)

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se41-nov9/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se41-nov9.gnugpu.3m.n096a4xX.so1n18.modup.MCS

I do not have any changes in place to avoid cold T by widening guard rails.

ndkeen commented 1 year ago

Using Nov16th repo, I am trying another case where I'm writing a restart at day 10.

Also, Peter B suggested I try a new case with the following change:

components/scream/src/physics/shoc/shoc_compute_shr_prod_impl.hpp

  // Turbulent coefficient
  //ndk const Scalar Ck_sh = 0.1;
  const Scalar Ck_sh = 1.0;

which I just launched.

ndkeen commented 1 year ago

The case where I simply re-ran the above using updated repo, it failed in exactly the same way (at Atmosphere step = 1586 and same error mesg as above). I do have a restart at day10

The new case with Ck_sh=1.0 is running and is already beyond step 1586. /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se42-nov16/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se42-nov16.gnugpu.3m.n096a4xX.so288n8.wr.ck10

The output I had for this run were just settings carried over from previous experiments and happen to be:

Averaging Type: Instant
Casename: ${CASE}.scream.hi
Fields:
  Physics PG2:
    Field Names:
    - ps
    - omega@bot
    - omega@lev_126
    - horiz_winds@bot
    - qv@bot
    - T_mid@bot
    - T_mid@lev_126
    - surf_sens_flux
    - surf_evap
    - surf_radiative_T
Max Snapshots Per File: 18
output_control:
  avg_type_in_filename: false
  Frequency: 1
  frequency_units: nsteps
  frequency_in_filename: false
  MPI Ranks in Filename: false
  Timestamp in Filename: true

This run failed after date 00010214 (44 days) with generic Bus error. I have restarts every 10th day and could try to see if it will fail again in same way

363: 
363: Program received signal SIGBUS: Access to an undefined portion of a memory object.

I restarted and it's already beyond where it failed.

It failed again with what looks to me like a MPICH error (seen them before) and I restarted again. The last job ran for only 1 month and stopped (only asked for another month). The last model date is model date = 00010310 Perhaps we should make a new case that includes the output we want.

ndkeen commented 1 year ago

I've started another 3month attempt using the following output settings. It may be way too much for what we need to check right now. Let me know what is desired.

Averaging Type: Instant
Casename: \${CASE}.scream.hi
Fields:
  Physics PG2:
    Field Names:
    - cldfrac_tot
    - IceCloudMask
    - qc
    - qi
    - qv
    - T_mid
    - horiz_winds@tom
    - SW_flux_up@tom
    - SW_flux_dn@tom
    - LW_flux_up@tom
    - SW_clrsky_flux_up@tom
    - LW_clrsky_flux_up@tom
    - LiqWaterPath
    - IceWaterPath
    - RainWaterPath
    - RimeWaterPath
    - VapWaterPath
    - ZonalVapFlux
    - MeridionalVapFlux
    - cldtot
    - cldlow
    - cldmed
    - cldhgh
    - horiz_winds
    - qr
    - ps
    - SeaLevelPressure
    - precip_liq_surf_mass
    - precip_ice_surf_mass
    - surf_evap
    - surf_sens_flux
    - ps
    - surf_mom_flux
    - horiz_winds@bot
    - SW_flux_up@bot
    - SW_flux_dn@bot
    - LW_flux_up@bot
    - LW_flux_dn@bot
    - surf_radiative_T
    - T_2m
    - qv_2m
    - wind_speed_10m
    - tke
    - omega
Max Snapshots Per File: 8
output_control:
  avg_type_in_filename: false
  Frequency: 36
  frequency_units: nsteps
  frequency_in_filename: false
  MPI Ranks in Filename: false
  Timestamp in Filename: true
PeterCaldwell commented 1 year ago

I think you should use the output request that Aaron is using for the 40 day runs. See the runscript in https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3555328064/Run+3+40-Day+production+run+-+DYAMOND+1+Configuration for example. It includes:

## EAMxx settings
./atmchange Scorpio::output_yaml_files=\
${eamxx_out_files_root}/scream_output.Cldfrac.yaml,\
${eamxx_out_files_root}/scream_output.QcQi.yaml,\
${eamxx_out_files_root}/scream_output.QvT.yaml,\
${eamxx_out_files_root}/scream_output.TOMVars.yaml,\
${eamxx_out_files_root}/scream_output.VertIntegrals.yaml,\
${eamxx_out_files_root}/scream_output.HorizWinds.yaml,\
${eamxx_out_files_root}/scream_output.QrPsPsl.yaml,\
${eamxx_out_files_root}/scream_output.SurfVars_128lev.yaml,\
${eamxx_out_files_root}/scream_output.TkeOmega.yaml,\
${eamxx_out_files_root}/scream_output.Temp_2m_min.yaml,\
${eamxx_out_files_root}/scream_output.Temp_2m_max.yaml,\
${eamxx_out_files_root}/scream_output.Qv_2m.yaml
ndkeen commented 1 year ago

I actually don't like the method of using those files in an exterior repo. What I tested yesterday was doing the exact same thing in the script and I can do that as well. But the above is with all of those variables included.

PeterCaldwell commented 1 year ago

But you've included all the variables in one output file. This is a problem because we generally save the 2d variables every 15 min and the 3d variables every 3 hrs. It's impossible to have one output file with some variables saved at one frequency and other variables saved at a different frequency. It looks like you're saving every 36 steps, which would be every 1 hr with ne1024 dt... I'm not sure what that is at ne256. Also, having everything in one file might be ok at ne256, but would result in monster file sizes at ne1024.

Also, regarding having the output request in an external file - I agree that it makes the output request more opaque in one sense, but the files Aaron is grabbing from are themselves from a git repo so there is provenance. You could copy/paste all of the text from all of the output yaml files into the run script... but that would be a lot of text.

ndkeen commented 1 year ago

Right, I thought that might have worked for this case. I've already cancelled the job and submitted a new one with the output the way it would be in Aaron's script.

But as you mentioned, those were for ne1024 and we may need changes for ne256. Which is one reason I thought we could just get the right variables in place and work on the frequency.

I had thought the goal right now was to a) verify ne256 runs for a while with this new change from Peter B b) write at least T_min to see how low T reaches c) anything else?

"but that would be a lot of text" Yes it is. We could probably reduce it if I knew more about how these YAML files are being parsed.

PeterCaldwell commented 1 year ago

Ok great. Yeah, I think the key thing is testing the Peter B change. Getting the minT will also be super helpful. Note that comes solely from the scream_output.Temp_2m_min.yaml file, so you could just focus on that. I think switching to the standard output request we're using for everything else adds additional value, but isn't that important.

We probably do want to get the time frequency right for this in order for the minT output to make sense. Do you know what the timestep is for ne256? It should be printed in the ne256 runs you've already done somewhere. This is set by the ATM_NCPL xml variable. This stupid variable gives # of timesteps per day, so we need to convert that to seconds.

ndkeen commented 1 year ago

ATM_NCPL: 144

The new case with more output is running now. /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se43-nov18/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se43-nov18.gnugpu.3m.n096a4xX.ybo.s8

This run crashed as before with cold T. I forgot to change the Ck variable in SHOC source.

PeterCaldwell commented 1 year ago

Ok, so 10 min dt. So maybe we should be writing 2d output every 20 min (Frequency: 2 and Max Snapshots Per File: 72) and the 3d output should still be every 3 hrs (Frequency: 3, frequency_units: nhours, Max Snapshots Per File: 8). It's probably ok to let the existing run continue even though it has weird frequency.

ndkeen commented 1 year ago

box

I can add some columns so that we can see frequency/snaps for ne1024/ne256 (for example) to make it easier to think about. Not sure if github will let me upload a XLS file, but can add it to a repo (like scream-docs).

Also, I see that we have T_2m as a Min and Max and is going to same output file: output.scream.Temp_2m (but are in described in 2 diff yaml files: scream_output.Temp_2m_min.yaml, scream_output.Temp_2m_max.yaml)

I will wait to resubmit until we sort out what output we want for this ~3month ne256 test with Ck=1.0

@mt5555 you were wanting to see output for min values of T. Perhaps I could just add output for that? Was there anything else we wanted?

PeterCaldwell commented 1 year ago

Yeah, I think the choice of max, min, instant, ave need to be made on a per file basis, so you can't have minT and maxT in one file... If you did, you'd have 2 variables with the same name and different meaning!

ndkeen commented 1 year ago

Whoops, I see what happens. So even though we see:

scream-docs/v1_output:

login33% grep Case scream_output.Temp_2m_m*
scream_output.Temp_2m_max.yaml:Casename: output.scream.Temp_2m
scream_output.Temp_2m_min.yaml:Casename: output.scream.Temp_2m

we also tack on an additional string to each filename. Such that they will look like this:

-rw-rw-r--  1 ndk ndk     6303412 Nov 18 10:28 output.scream.Temp_2m.MIN.ndays_x1.0001-01-09-00000.nc
-rw-rw-r--  1 ndk ndk     6303412 Nov 18 10:28 output.scream.Temp_2m.MAX.ndays_x1.0001-01-09-00000.nc

ie, the MIN and MAX strings make these 2 diff files -- as we want.

Note: I still don't have another ne256 job in the Q.

PeterCaldwell commented 1 year ago

Ok, phew. So we're in good shape then, right?

ndkeen commented 1 year ago

While trying to better organize the output settings, I noticed that potential issue, but alas, it was not an issue and should be fine in current ne1024 runs. However, I still don't know how best to set ne256 output settings for what we want to learn here in this GH issue.

mt5555 commented 1 year ago

Regarding min T output I was suggesting: the Temp_2m.MIN file above has everything we need to see if we get a cold T issue. But one has to parse the netcdf output file to determine this. For a quick check while the model is running, you can set statefreq to output about once every 3 hours, and look at "TBOT" in the homme.log file. It's instantaneous output, not a cumulative minimum.

Assuming the default NE256 dycore timestep (200/6 seconds), statefreq=324. in the C++ code, statefreq is expensive - I think it costs about 100 timesteps, so probably better to skip this.

ndkeen commented 1 year ago

Using Nov 24th scream repo, I tried to run ne256 again on pm-gpu and see a "cold T" error. This was using Ck=1.0 which we were hoping might help -- and it did appear to help with the case documented above. But this case fails.

--- a/components/scream/src/physics/shoc/impl/shoc_compute_shr_prod_impl.hpp
+++ b/components/scream/src/physics/shoc/impl/shoc_compute_shr_prod_impl.hpp
@@ -24,7 +24,8 @@ void Functions<S,D>
   const uview_1d<Spack>&       sterm)
 {
   // Turbulent coefficient
-  const Scalar Ck_sh = 0.1;
+  //ndk const Scalar Ck_sh = 0.1;
+  const Scalar Ck_sh = 1.0;

The case ran 17 days before hitting:

 60: terminate called after throwing an instance of 'std::logic_error'
 60:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se46-nov24-scorpiomaster/components/eamxx/src/share/atm_process/atmosphere_process.cpp:256: FAIL:
 60: false
 60: Error! Failed post-condition property check (cannot be repaired).
 60:   - Atmosphere process name: Dynamics
 60:   - Property check name: T_mid within interval [130, 500]
 60:   - Atmosphere process MPI Rank: 60
 60:   - Message: Check failed.
 60:   - check name: T_mid within interval [130, 500]
 60:   - field id: T_mid[Physics PG2] <double:COL,LEV>(8192,128) [K]
 60:   - minimum:
 60:     - value: 127.747
 60:     - entry: (478766,127)
 60:     - lat/lon: (29.2041, 94.1309)
 60:   - maximum:
 60:     - value: 310.784
 60:     - entry: (459433,127)
 60:     - lat/lon: (21.9056, 104.854)
 60:
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se46-nov24.gnugpu.1m.n048a4xX.wr.s8
ndkeen commented 1 year ago

I'm running ne256 cases again on pm-gpu. These are with Dec 5th repo and no source changes (Ck=0.1 and repo includes change to allow lower T's). I'm working with @crterai to write output to help find T issues. I have run up to model date = 00010126. These are with 48 pm-gpu nodes and I see about 0.65 SYPD. When I tried 96 nodes, it was not much better, so will look at why that's the case.

I did have a couple of jobs hang for no apparent reason -- ie not writing output, and were not at the same location. Subsequent jobs have been OK.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se47-dec5/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se47-dec5.gnugpu.14d.n048a4xX.minToutputb

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se47-dec5/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se47-dec5.gnugpu.10d.n048a4xX.minToutput
ndkeen commented 1 year ago

Plot from @crterai of those last runs: image (1)

whannah1 commented 1 year ago

@ndkeen @crterai Why is T_2m consistently hovering at -100C? Is that plot showing the global minimum for each variable?

oksanaguba commented 1 year ago

Not sure which plot Walter meant, but 100 is set as the min value in the limiter in the dycore.

mt5555 commented 1 year ago

the more or less stationary T_2m of 170K = -100C appears to be due to a stationary feature in TS, maybe coming from a mapping file issue: https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3597565953/2022-12-12+SCREAM+Eval+Meeting+notes

because of this glitch, min(T_2m) is not that useful, good thing we aslo have T_mod@bot

whannah1 commented 1 year ago

ok, I kinda figured it was a bogus value.

ndkeen commented 1 year ago

With Dec26th scream repo, I tried the same case again. Running for 1 month at a time with a restart. I had two cases, one normal, and the other does not use MPICH on GPU (as in https://github.com/E3SM-Project/scream/pull/2102). Both cases ran for almost 4 months, and failed at model date = 00010430 with error below.

@PeterCaldwell noted that this is NOT the "cold T" error originally reported.

143: terminate called after throwing an instance of 'std::logic_error'
143:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se48-dec26/components/eamxx/src/share/atm_process/atmosphere_process.cpp:313: FAIL:
143: false
143: Error! Failed post-condition property check (cannot be repaired).
143:   - Atmosphere process name: SurfaceCouplingImporter
143:   - Property check name: NaN check for field surf_radiative_T
143:   - Atmosphere process MPI Rank: 143
143:   - Message: FieldNaNCheck failed.
143:   - field id: surf_radiative_T[Physics PG2] <double:COL>(8192) [K]
143:   - entry (1207596)
143:   - lat/lon: (-69.192895, 296.901490)
143:
143:  *************************** INPUT FIELDS ******************************
143:
143:   ------- INPUT FIELDS -------
143:
143:  ************************** OUTPUT FIELDS ******************************
143:      T_2m<COL>(8192)
143:
143:   T_2m(6955)
143:     132.811,
143:  -----------------------------------------------------------------------
143:      qv_2m<COL>(8192)
143:
143:   qv_2m(6955)
143:     7.00931e-06,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dif_nir<COL>(8192)
143:
143:   sfc_alb_dif_nir(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dif_vis<COL>(8192)
143:
143:   sfc_alb_dif_vis(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dir_nir<COL>(8192)
143:
143:   sfc_alb_dir_nir(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dir_vis<COL>(8192)
143:
143:   sfc_alb_dir_vis(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      snow_depth_land<COL>(8192)
143:
143:   snow_depth_land(6955)
...
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se48-dec26/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se48-dec26.gnugpu.n048a4xX.hangtry
crterai commented 1 year ago

Noel ran with more surface related output, and the surface radiative temperature shows the same cold T, which shows that the source of the stationary cold temperatures in T_2m are likely from the surface.

image
crterai commented 1 year ago

I expected SHF in these spots to be very negative, since surface T is cold but the air above (assuming it gets replenished) is warmer. But it doesn't show this.

image

Note, last two plots are 24 hrs into the simulation.

whannah1 commented 1 year ago

@ndkeen did you output vertically resolved variables so that I could make some height-vs-time plots like I did for this previous case in the page below? https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3547136090/Case+Study+-+Cold+Temperature+Event

Those plots showed a curious region of liquid cloud despite the normal temperatures being too cold to support sustained liquid water. The cold event occurred just as this liquid water reached the ground, making me think that the root of the problem lies in the microphysics. So plotting similar height-vs-time plots of T, qv, ql, qi for this ne256 case would be nice to look at if we have the data.

ndkeen commented 1 year ago

For the longer running cases, we were only outputting vars to look at T min. For the plots above made by Chris T, we added more output, but only ran for 2 days. If you have a set of output vars in mind, I can easily run a case for N days.

Averaging Type: Instant                                                                                                                                                                                                                                       
Casename: \${CASE}.scream.surf_output                                                                                                                                                                                                                         
Averaging Type: Instant                                                                                                                                                                                                                                       
Max Snapshots Per File: 144                                                                                                                                                                                                                                   
Fields:                                                                                                                                                                                                                                                       
  Physics PG2:                                                                                                                                                                                                                                                
    Field Names:                                                                                                                                                                                                                                              
    - T_2m                                                                                                                                                                                                                                                    
    - T_mid@bot                                                                                                                                                                                                                                               
    - surf_evap                                                                                                                                                                                                                                               
    - qv_2m                                                                                                                                                                                                                                                   
    - surf_sens_flux                                                                                                                                                                                                                                          
    - surf_radiative_T                                                                                                                                                                                                                                        
output_control:                                                                                                                                                                                                                                               
  Frequency: 1                                                                                                                                                                                                                                                
  frequency_units: nsteps                                                                                                                                                                                                                                     
  MPI Ranks in Filename: false                                                                                                                                                                                                                                
whannah1 commented 1 year ago

@ndkeen Ideally I want vertically resolved temperature and all water species at every time time step for a ~100km region around where the event happens, is that doable?

crterai commented 1 year ago

@whannah1 I was maybe unclear about what cold T problem I was referring to when I posted the plots. The dotted cold temperatures in the surface radiative T plots that I posted are what Mark mentioned earlier about stationary cold temperature features and are unlikely the same cold T problem we are intermittently seeing in our ne1024 simulation. We suspect that the stationary cold T are due to bad mapping with the tri-grid configuration. If you're interested in analyzing a cold T case that's analogous to what we see at ne1024, I think we'd need a longer run with instantaneous T_mid@bot output to see when they occur. Then we can write a restart sometime before that episode with the 3D variables you're interested in.

whannah1 commented 1 year ago

@crterai thanks for the clarification, I was definitely conflating the two problems. It still might be worth verifying that the cold anomaly starts at the surface with some vertically resolved data just in case the stationarity is coincidental.

ambrad commented 1 year ago

For future reference, the full error output in /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se48-dec26/f256.F2010-SCREAMv1.ne256pg2_r0125_oRRS18to6v3.se48-dec26.gnugpu.n048a4xX.hangtry/run/e3sm.log.4077303.221227-171011 is

143: terminate called after throwing an instance of 'std::logic_error'
143:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se48-dec26/components/eamxx/src/share/atm_process/atmosphere_process.cpp:313: FAIL:
143: false
143: Error! Failed post-condition property check (cannot be repaired).
143:   - Atmosphere process name: SurfaceCouplingImporter
143:   - Property check name: NaN check for field surf_radiative_T
143:   - Atmosphere process MPI Rank: 143
143:   - Message: FieldNaNCheck failed.
143:   - field id: surf_radiative_T[Physics PG2] <double:COL>(8192) [K]
143:   - entry (1207596)
143:   - lat/lon: (-69.192895, 296.901490)
143:
143:  *************************** INPUT FIELDS ******************************
143:
143:   ------- INPUT FIELDS -------
143:
143:  ************************** OUTPUT FIELDS ******************************
143:      T_2m<COL>(8192)
143:
143:   T_2m(6955)
143:     132.811,
143:  -----------------------------------------------------------------------
143:      qv_2m<COL>(8192)
143:
143:   qv_2m(6955)
143:     7.00931e-06,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dif_nir<COL>(8192)
143:
143:   sfc_alb_dif_nir(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dif_vis<COL>(8192)
143:
143:   sfc_alb_dif_vis(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dir_nir<COL>(8192)
143:
143:   sfc_alb_dir_nir(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      sfc_alb_dir_vis<COL>(8192)
143:
143:   sfc_alb_dir_vis(6955)
143:     1,
143:  -----------------------------------------------------------------------
143:      snow_depth_land<COL>(8192)
143:
143:   snow_depth_land(6955)
143:     0.96857,
143:  -----------------------------------------------------------------------
143:      surf_evap<COL>(8192)
143:
143:   surf_evap(6955)
143:     -4.37292e-06,
143:  -----------------------------------------------------------------------
143:      surf_lw_flux_up<COL>(8192)
143:
143:   surf_lw_flux_up(6955)
143:     25.1064,
143:  -----------------------------------------------------------------------
143:      surf_mom_flux<COL,CMP>(8192,2)
143:
143:   surf_mom_flux(6955,:)
143:     0.122333, -0.299795,
143:  -----------------------------------------------------------------------
143:      surf_radiative_T<COL>(8192)
143:
143:   surf_radiative_T(6955)
143:     -nan,
143:  -----------------------------------------------------------------------
143:      surf_sens_flux<COL>(8192)
143:
143:   surf_sens_flux(6955)
143:     610.756,
143:  -----------------------------------------------------------------------
143:      wind_speed_10m<COL>(8192)
143:
143:   wind_speed_10m(6955)
143:     8.01184,
143:  -----------------------------------------------------------------------
143:
143:
143: Program received signal SIGABRT: Process abort signal.
ndkeen commented 1 year ago

Using ne256pg2_ne256pg2 which was recently added to scream repo, I was able to run for 3 months. So similar to ne120, we don't see same issue with these "monogrids". OK to close issue?

Case: /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/badd120/f256.F2010-SCREAMv1.ne256pg2_ne256pg2.badd120.gnugpu.3m.n096a4xX.so288n8

mt5555 commented 1 year ago

so new monogrid results providing more evidence that the speckling seen above is a tri-grid mapping file issue.

But in the ne256pg2_ne256pg2 do you see any evidence of the other cold T problem: (isolated transient spikes of T<200K)?

crterai commented 1 year ago

Here's the plot with global minimum T_mid@bot (bottom level temperature) as a function of time. x-axis says Days, but is actually steps. As discussed above, simulation runs 3 months.

image
PeterCaldwell commented 1 year ago

great plot, Chris. It is reassuring(?) to see that ne256 still has occasional coldT events.

crterai commented 1 year ago

It is reassuring that we still get occasional cold T in ne256, since it opens up the ability to analyze/debugg cold T at ne256, rather than at ne1024. The following figure of global minimum surface radiative T supports what Mark already said above about the speckly cold T going away with ne256pg2_ne256pg2 (minimum T isn't persistently below 200K). The cold spikes in surface radiative T coincide with the cold Ts in the T_mid@bot. We'll need to look at the sensible heat flux in these cases to see what's causing which to get colder, but I expect it's the atmosphere driving the change in surface temperature.

image