E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
80 stars 55 forks source link

Negative (or nan) layer thickness detected with ne256 Cess test on pm-gpu #2543

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

With scream master of Sep 12th, I see an error with a ne256 Cess-like test on pm-gpu. I have already reproduced the error with a different case and it fails in the same way (hashes are also same). The fail is after model date = 20190929. I ran 1 month, then restarted, where it fails there.

170: WARNING:CAAR: dp3d too small. k=128, dp3d(k)=35.733449, dp0=300.714111 
170: Negative (or nan) layer thickness detected, aborting!
170: Exiting...
170: MPICH ERROR [Rank 170] [job id 15852337.0] [Tue Sep 19 11:07:37 2023] [nid003488] - Abort(101) (rank 170 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 101) - process 170
170: 
170: aborting job:
170: application called MPI_Abort(MPI_COMM_WORLD, 101) - process 170
170: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se75-sep12/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.se75-sep12.n0048t4x111XX1.tb.nofru.long

Another issue here is that the job is hanging after the error.

crterai commented 1 year ago

I remember that some of our issues have stemmed from having too long of a SHOC timestep. I see that in the ne256 setup, our dtime = 600 sec and our macmic is 3, which sets the SHOC timestep to 200sec. @bogensch - can you remind us how long the shoc timestep needs to be?

bogensch commented 1 year ago

I would advocate making sure that SHOC time step is never greater than 150 s at any resolution configuration; since that seems to work well for long ne30 integrations. While 200 s may be okay, I'm not 100% sure about that since we've never tested this in a long simulation.

ndkeen commented 1 year ago

Note that we ran 1 year with ne256 on pm-gpu (both control and plus 4K). That was a March27th checkout. I don't think we've changed the macmic in a while.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se62-mar27/t.se62-mar27.F2010-SCREAMv1.ne256pg2_ne256pg2.pm-gpu.n096t4xX.L128.vth200.plus4k

I verified that in this current ne256 case as well as those noted above, macmic is same at 3.

login18%  ./atmquery atmosphere_processes::physics::mac_aero_mic::number_of_subcycles
    mac_aero_mic::number_of_subcycles: 3

@mahf708 says he just ran 2 months of ne256 on pm-gpu. Though not Cess-style, I assume it was a recent scream repo.

mahf708 commented 1 year ago

I keep the hash in my case names 😉 6bb3639

Vanilla F2010-SCREAMv1 with light IO (7 or so 3-hourly 2D variables). I don't recall anything special except 20230522.I2010CRUELM.ne256pg2.elm.r.2013-08-01-00000.nc for land. I mention land because I ran into a lot of trouble with land stuff in EAMf90 ne120pg2 in the past, with "mysterious" errors like the above...

ndkeen commented 1 year ago

On frontier, using tcclevenger/simulations/cess-production-cherry-pick-merges (with team barrier), I'm able to run ne256 longer. Currently at model date = 20191113

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/tcessr-sep15/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.tcessr-sep15.n0096t8x111661.tb.noSC.newy