Closed ndkeen closed 2 years ago
After https://github.com/E3SM-Project/scream/pull/1500 (which is now in scream master), I confirmed that I can now run 2 steps WITH SPA (cpu-only) and that output format is now cdf5
.
More good news -- using March 23 master that includes upstream merge and the above mentioned pr1500 -- I am able to run 2 steps with ne30 using GPU's. I first tried 1 node with 4 MPI/GPU's which also fails with the pool error, but when I try using 4 PM nodes with 16 MPI/GPU's, it completes. This test was built DEBUG and had output off. So it's possible that the GPU attempts above just needed more memory.
However, running beyond 2 steps (in fact on the next step):
10: Negative (or nan) layer thickness detected, aborting!
I also tried ne30 cpu-only with output and it seems OK for 2 steps.
With CPU-only and OPT build, I see a new error:
15: array_io_read failed with: /pscratch/sd/n/ndk/wacmy/s17-mar23/externals/ekat/src/ekat/util/ekat_file_utils.hpp:24: FAIL:
15: nread == sz
15: read: nread = 1023 sz = 3000
15: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation p3_iso_c::p3_init: One or more table files exists but gave a read error.
0: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation p3_iso_c::p3_init: One or more table files exists but gave a read error.
0: array_io_read failed with: /pscratch/sd/n/ndk/wacmy/s17-mar23/externals/ekat/src/ekat/util/ekat_file_utils.hpp:24: FAIL:
0: nread == sz
0: read: nread = 1023 sz = 3000
15: terminate called after throwing an instance of 'std::logic_error'
15: what(): /pscratch/sd/n/ndk/wacmy/s17-mar23/components/scream/src/physics/p3/p3_f90.cpp:122: FAIL:
15: info == 0
15: p3_init_c returned info -1
15:
Note the actual error message looks odd because the SPA warning message is missing a newline (https://github.com/E3SM-Project/scream/issues/1434).
Regarding the CPU-only OPT error: I'm surprised v1 is calling p3_f90.cpp at all - isn't that a bridge function just used for F90/C++ BFB testing? Do you have a stack trace for this, Noel? Does calling p3_f90.cpp make sense to you, @AaronDonahue or @jgfouca ? It looks to me like the fail in p3_init is in reading the table lookup file. It would be nice if we had a log message stating which file we are reading whenever we read a file (at least in debug mode).
Regarding the step 3 fail on GPUs: did you ever try running with a shorter dt, Noel?
The only fail in init right now is https://github.com/E3SM-Project/scream/issues/1505 which only happens sometimes.
I did try with se_tstep: 100
and don't see anything different.
I have also tried changing ATM_NCPL without seeing a difference, but as it's confusing, it might be good to specify what would be good settings for me to try and I can do that.
Update: I just tried ATM_NCPL=720
which should be a 2 min dtime with se_tstep: 60
and was able to complete 4 steps (on PM with GPU). I tried again for longer on both PM/GPU and cori-knl and it stops in the same way for both machines after 5th step:
0: Atmosphere step = 5; model time = 0001-01-01 00:10:00
450: ERROR:
450: component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
450: index: 32761
450: Image PC Routine Line Source
450: e3sm.exe 0000000008B1EABA Unknown Unknown Unknown
450: e3sm.exe 000000000557480A shr_abort_mod_mp_ 114 shr_abort_mod.F90
450: e3sm.exe 0000000005574670 shr_abort_mod_mp_ 61 shr_abort_mod.F90
450: e3sm.exe 000000000048332A component_type_mo 257 component_type_mod.F90
450: e3sm.exe 000000000047B1A8 component_mod_mp_ 754 component_mod.F90
450: e3sm.exe 000000000043A8A7 cime_comp_mod_mp_ 3077 cime_comp_mod.F90
450: e3sm.exe 00000000004628D0 MAIN__ 153 cime_driver.F90
Peter wanted me to try with a simple halving of timestep, so that's ATM_NCPL=576" or dtime=150 seconds and
se_tstep: 150`. With this it fails with NLT after step 4.
0: Atmosphere step = 4; model time = 0001-01-01 00:10:00
674: Negative (or nan) layer thickness detected, aborting!
674: Exiting...
Great, thanks Noel. So it seems like we can run almost twice as far with a timestep that's ~twice as long. That is what we would expect if it was a real physical instability rather than something that always keys off the 3rd timestep or something. Are you saving output every timestep? If not, could you, then point me to the output? You can change the output frequency in run/data/scream_output.yaml. Change "Frequency" near the bottom of that file. Also probably want to change "Max Snapshots per Field" to 1 so we are sure to get all the output up to the time it crashes (since netcdfs often don't get flushed until they are closed).
Here is a case where I think I'm writing every step.
/pscratch/sd/n/ndk/e3sm_scratch/perlmutter/s18-mar23/f30.F2000SCREAMv1.ne30_ne30.s18-mar23.gnugpu.12s.n003a4x8.DEBUG.Hremap512.K0def.WSM.Q10.nospa.nan.N576.ts150.os1
Wow, this is an interesting case. There's definitely a bug in SW radiation. Check out surface SW down at timestep 1 (all timesteps are similar): Which I think is responsible for the "psychedelic haircut" in T_mid (also from step 1). Note this is T_mid at the top of the model. The surface looks normal: Note in particular the "mole" of bright yellow below the right side of the haircut. This mole is ultimately what grows to 132,042K by step 5 and causes the model to crash. I'm not sure whether the mole is related to the radiation problem or is something different. Do we know whether the sponge layer is active in our CIME runs?
One other thing to note is that LWdn also displays the "haircut" geometry in step 1 and grows to -369,543 W/m2 over the mole by the end of the simulation. I think this is a natural reaction to the ridiculous T_mid in the mole, but just mentioning it.
Overall, I think the haircut is due to rad info not getting passed to MPIs correctly. @ndkeen - could you do the same run with the same output, but with 2x more MPIs? @brhillman - does the haircut raise any bells for you?
Can it be just a different number of MPI's? That was with 12 MPI's -- OK to try 8 MPI's or 16 MPI's?
Yeah, any number of MPIs is fine. Knowing that it's 12 MPIs makes me think that's not the problem. The haircut looks like 4 MPIs misplaced... but we should still try a different MPI count just in case. Now I'm thinking it is something related to the zenith angle or the grid.
Yea, just gets thru Q quicker with 4 or less nodes. This should be same thing but on 8 MPI's instead of 12.
/global/cfs/cdirs/e3sm/ndk/f30.F2000SCREAMv1.ne30_ne30.s18-mar23.gnugpu.12s.n002a4x8.DEBUG.Hremap512.K00def.WSM.Q10.nospa.nan.N576.ts150.os1
The 8 MPI run looks identical to the 12 MPI version (which is good!). So I guess MPIs aren't the problem. I think adding print statements to the code around where we're getting the divide by zero and adding zenith angle(?) to the output is the next step...
to check if sponge layer is active in SCREAM v0: look for "raytay0" and "nu_top" settings in namelist. I dont think SCREAM v1 has Rayleigh friction option, so hopefully SCREAM v0 was run with raytay0=0.
nu_top is resolution dependent, but should be 2.5e5 for 1degree.
more details: https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/2967798203/EAM+Top+of+Model+Sponge+Layer
@mt5555 - I just checked the run which I plotted above (which is a v1 run, not a v0 run) and found nu_top = 250000.0. raytay0 is not included. Is just setting nu_top sufficient for turning on sponge? Is 250,000 the appropriate number for ne30? I'll try an ne30 v0 run, but probably won't have time until this afternoon...
@PeterCaldwell , in order to avoid duplication, we left some of the p3 init stuff in fortran. I wonder, now that we are leaving the fortran behind, it might make sense to move it all over to C++.
Ok, thanks Jim. It would be good to be pure C++, but I don't think we need to do that now. I was just surprised that v1 was calling F90 in P3.
Using -DRRTMGP_EXPENSIVE_CHECKS
on builds, I found an issue a little higher up food chain which led to Ben fixing an issue in the input file. With the new one /global/cfs/cdirs/e3sm/bhillma/scream/data/init/screami_ne30np4L72_20220329.nc
, we've made more progress with ne30. I'm still trying to see what situations cause crashes, what's slow, and what's non-BFB.
Now that we can kinda say we are running ne30 as a cime case, maybe could close this issue and make more specific ones.
Currently, I was able to run over a day with default dtime=300s, spa, and some output. However, I'm hitting some fails that are happening at seemingly random points (ie not at same step). With OPT builds, there is no useful information about the fail. Trying again with DEBUG and I see the following error that happened twice at 2 different steps (in this case, step 62 and 72):
3: FATAL ERROR:
3: gas_optics(): array tsfc has values outside range
which we know is coming from code under RRTMGP_EXPENSIVE_CHECKS
macro.
Also, as Conrad pointed out and I confirmed on PM, with GPU builds, we are not BFB after 2nd step between 2 otherwise identical runs. This may be OK as there are known issues with rad not being BFB on GPU. We do expect the CPU-only cases to be BFB between runs and I've verified that is true (at least by looking at values written to e3sm.log). I tested with and without threads -- within the case, 2 runs are BFB with each other.
Closing this issue as we are beyond this point and some of the steps here are no longer valid.
Using a launch script similar to
SMS_D_Ln2_P4x1.ne30_ne30.F2000SCREAMv1.perlmutter_gnu
, and following directions here:https://acme-climate.atlassian.net/wiki/spaces/NGDNA/pages/3330506773/Getting+running+at+higher+resolution#Steps-for-running-at-ne30%3A
One thing I did differently is I'm using a cdf5 format netcdf file. That is use:
I tried a case on Perlmutter using a cpu-only build. This is using 1 node, but with 4 MPI's. I also see same error message using 4 nodes and 4 MPI's (1 MPI per node). The following is from a DEBUG attempt and I will also try without DEBUG.