Open ndkeen opened 1 year ago
I remember that some of our issues have stemmed from having too long of a SHOC timestep. I see that in the ne256 setup, our dtime = 600 sec and our macmic is 3, which sets the SHOC timestep to 200sec. @bogensch - can you remind us how long the shoc timestep needs to be?
I would advocate making sure that SHOC time step is never greater than 150 s at any resolution configuration; since that seems to work well for long ne30 integrations. While 200 s may be okay, I'm not 100% sure about that since we've never tested this in a long simulation.
Note that we ran 1 year with ne256 on pm-gpu (both control and plus 4K). That was a March27th checkout. I don't think we've changed the macmic in a while.
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200.plus4k
I verified that in this current ne256 case as well as those noted above, macmic is same at 3.
login18% ./atmquery atmosphere_processes::physics::mac_aero_mic::number_of_subcycles
mac_aero_mic::number_of_subcycles: 3
@mahf708 says he just ran 2 months of ne256 on pm-gpu. Though not Cess-style, I assume it was a recent scream repo.
I keep the hash in my case names 😉 6bb3639
Vanilla F2010-SCREAMv1 with light IO (7 or so 3-hourly 2D variables). I don't recall anything special except 20230522.I2010CRUELM.ne256pg2.elm.r.2013-08-01-00000.nc for land. I mention land because I ran into a lot of trouble with land stuff in EAMf90 ne120pg2 in the past, with "mysterious" errors like the above...
On frontier, using tcclevenger/simulations/cess-production-cherry-pick-merges
(with team barrier), I'm able to run ne256 longer. Currently at model date = 20191113
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/tcessr-sep15/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.tcessr-sep15.n0096t8x111661.tb.noSC.newy
With scream master of Sep 12th, I see an error with a ne256 Cess-like test on pm-gpu. I have already reproduced the error with a different case and it fails in the same way (hashes are also same). The fail is after
model date = 20190929
. I ran 1 month, then restarted, where it fails there.Another issue here is that the job is hanging after the error.