E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
80 stars 57 forks source link

"Cold T" error with ne1024 DYAMOND1 on pm-gpu using Ck_sh=1.0 #2063

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

We already have issues for "cold T" error with ne120 and ne256, but this is an issue at ne1024 (which may should all be one issue). https://github.com/E3SM-Project/scream/issues/2061 https://github.com/E3SM-Project/scream/issues/2029

We were hoping CK_sh=1.0 might help with this problem, but appears to still happen at least with the DYAMOND1 compset. This case failed after 3 days:

1145: terminate called after throwing an instance of 'std::logic_error'
1145:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se46-nov24-scorpiomaster/components/eamxx/src/share/atm_process/atmosphere_process.cpp:256: FAIL:
1145: false
1145: Error! Failed post-condition property check (cannot be repaired).
1145:   - Atmosphere process name: Dynamics
1145:   - Property check name: T_mid within interval [130, 500]
1145:   - Atmosphere process MPI Rank: 1145
1145:   - Message: Check failed.
1145:   - check name: T_mid within interval [130, 500]
1145:   - field id: T_mid[Physics PG2] <double:COL,LEV>(16384,128) [K]
1145:   - minimum:
1145:     - value: 126.687
1145:     - entry: (19317940,127)
1145:     - lat/lon: (-69.4069, 296.625)
1145:   - maximum:
1145:     - value: 275.708
1145:     - entry: (19219588,127)
1145:     - lat/lon: (-69.2069, 290.139)

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f1024.F2010-SCREAMv1-DYAMOND1.ne1024pg2_ne1024pg2.se46-nov24.gnugpu.12d.n0384a4xX.prod.ybo.s32.ck10

The Ck=1.0 change is:
--- a/components/scream/src/physics/shoc/impl/shoc_compute_shr_prod_impl.hpp
+++ b/components/scream/src/physics/shoc/impl/shoc_compute_shr_prod_impl.hpp
@@ -24,7 +24,8 @@ void Functions<S,D>
   const uview_1d<Spack>&       sterm)
 {
   // Turbulent coefficient
-  const Scalar Ck_sh = 0.1;
+  //ndk const Scalar Ck_sh = 0.1;
+  const Scalar Ck_sh = 1.0;

Note that a similar using the F2010-SCREAMv1 compset (ie, not DYAMOND1) does run 12 days without error. /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f1024.F2010-SCREAMv1.ne1024pg2_ne1024pg2.se46-nov24.gnugpu.12d.n0384a4xX.prod.ybo.s32.ck10

ndkeen commented 1 year ago

As noted in a different issue, having trouble writing restarts at ne1024 in general. But using 512 pm-gpu nodes, some of the restarts work. So I tried to write a restart the day before the crash and it worked. I can read from restart and see the same "cold T" error at

Atmosphere step = 3478
  model time = 2016-08-03 00:36:40

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f1024.F2010-SCREAMv1-DYAMOND1.ne1024pg2_ne1024pg2.se46-nov24.gnugpu.1d.n0512a4xX.prod.ybo.s32.ck10.wr