Open ndkeen opened 1 year ago
As noted in a different issue, having trouble writing restarts at ne1024 in general. But using 512 pm-gpu nodes, some of the restarts work. So I tried to write a restart the day before the crash and it worked. I can read from restart and see the same "cold T" error at
Atmosphere step = 3478
model time = 2016-08-03 00:36:40
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f1024.F2010-SCREAMv1-DYAMOND1.ne1024pg2_ne1024pg2.se46-nov24.gnugpu.1d.n0512a4xX.prod.ybo.s32.ck10.wr
We already have issues for "cold T" error with ne120 and ne256, but this is an issue at ne1024 (which may should all be one issue). https://github.com/E3SM-Project/scream/issues/2061 https://github.com/E3SM-Project/scream/issues/2029
We were hoping
CK_sh=1.0
might help with this problem, but appears to still happen at least with the DYAMOND1 compset. This case failed after 3 days:Note that a similar using the
F2010-SCREAMv1
compset (ie, not DYAMOND1) does run 12 days without error./pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se46-nov24/f1024.F2010-SCREAMv1.ne1024pg2_ne1024pg2.se46-nov24.gnugpu.12d.n0384a4xX.prod.ybo.s32.ck10