Closed amametjanov closed 6 years ago
I have been complaining about this directly to @bishtgautam :) I hit this more than usual as I have been changing PE layouts often. I do think the code should do something beside what it does. I understand it's not easy to know if there will be too many tasks for LND before the job runs, and it may take some work to change how this is handled to reduce total LND tasks when needed. I think it is not just a limit on the LND MPI tasks, but the total number of threads -- as I sometimes hit this when I increase threads (and then have to leave LND threads at 1 to work-around).
If LND_TASKS * LND_THRDS > Total number of active land grid cells
, the land model will stop. The land F90 code cannot handle the above-mentioned processor layout.
My experience is that it gets worse than this. I can't speak for the current master, but on maint-1.0 if you accidentally assign more tasks than the total number of grid cells the whole run is corrupted. Meaning that if you go back and simply fix the PE-Layout to work with LND the job will still fail with the following error:
0: CalcWorkPerBlock: Total blocks: 24301 Ice blocks: 24301 IceFree blocks: 0 Land blocks: 0
188: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x0000000018942f70 ***
2: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x00000000189e2100 ***
189: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x000000001b3e12d0 ***
.
.
.
The only solution I have found is to delete the build and start over again with the correct PE-Layout.
This output appears to be coming from the older CICE component (not MPAS-Seaice) that is active in F-compsets. PE layout changes in F-cases require reset and re-build. Please add case.setup -r && case.build
to the workflow (delete is not necessary).
ahh ok, thanks. I will try this. As a perhaps slightly related question, is it possible that a particular PE-Layout can cause a floating point invalid to be formed and then passed throughout the code?
I have a branch off of maint-1.0 that I've been running with many different PE-Layouts and have had some jobs work and some jobs fail. The jobs that fail all have this error:
0: CalcWorkPerBlock: Total blocks: 24301 Ice blocks: 24301 IceFree blocks: 0 Land blocks: 0
5427: forrtl: error (65): floating invalid
5427: Image PC Routine Line Source
5427: e3sm.exe 000000000431FD7E Unknown Unknown Unknown
5427: e3sm.exe 0000000003B8AF00 Unknown Unknown Unknown
5427: e3sm.exe 0000000001084B49 clubb_intr_mp_clu 1584 clubb_intr.F90
5427: e3sm.exe 00000000006051EB physpkg_mp_tphysb 2489 physpkg.F90
5427: e3sm.exe 00000000006017C3 physpkg_mp_phys_r 1038 physpkg.F90
5427: e3sm.exe 00000000004F03B7 cam_comp_mp_cam_r 251 cam_comp.F90
5427: e3sm.exe 00000000004E264F atm_comp_mct_mp_a 341 atm_comp_mct.F90
5427: e3sm.exe 000000000042C29F component_mod_mp_ 267 component_mod.F90
5427: e3sm.exe 00000000004203BD cime_comp_mod_mp_ 1958 cime_comp_mod.F90
5427: e3sm.exe 000000000042910C MAIN__ 92 cime_driver.F90
5427: e3sm.exe 000000000040A80E Unknown Unknown Unknown
5427: e3sm.exe 00000000043FF479 Unknown Unknown Unknown
5427: e3sm.exe 000000000040A6F9 Unknown Unknown Unknown
But I can't imagine that CLUBB is actually to blame since the same exact code worked and ran to completion with a different PE-Layout.
But I can't imagine that CLUBB is actually to blame since the same exact code worked and ran to completion with a different PE-Layout.
which machine are you on?
I'm running on Cori-KNL. I should say that I am running a branch with additions I made to run physics/dynamics in parallel, but I never ran into this problem with the same branch on Livermore Computing machines.
Also, the floating point invalid only happens at O(1) cores.
This may be something similar to #1183 , although the error is different here. If you haven't changed clubb_intr.F90 then the line it is crashing at is: https://github.com/E3SM-Project/E3SM/blob/a4ac51d295751309bcb4193e2bc6d91c1bb4ee51/components/cam/src/physics/cam/clubb_intr.F90#L1584
Therefore, for some reason, there is something wrong with either state1%pdel(i,pver-k+1)
or qrl(i,pver-k+1)
variable when it crashes.
Yes, and it's a difficult bug to reproduce since sometimes recompiling and running will work. @ndkeen pointed out that compiling with -fpe0 might be part of the problem, which I see is discussed in #1183 as you pointed out. I'm running maint-1.0 on the FC5AV1C-L compset and checking the atm.bldlog file the compilation has the -fpe0 flag.
Also, the floating point invalid only happens at O(1) cores.
This might also be due to an OOM issue: too few tasks allocated to physics.
@amametjanov , I've been able to monitor a few more jobs running (and failing). I looks like when it does fail it is when there are a lot of cores assigned to physics, in particular, more than there are the number of elements in the grid. (i.e. > 5400 for ne30). But I caution to say this the root of the problem since I have also had a number of jobs with lots of cores assigned to physics also run (e.g. ATM-PE = 12,151, dynamics elements = 5400) ran just fine.
I have also had jobs fail with the same issue which have had 1/2 as many cores as elements.
I haven't had the problem on Livermore Computing, so I wonder if it is unique to Cori-KNL. I'm going to experiment with master and with maint-1.0 without my changes to see if I can isolate the issue to changes I made.
For example, ne4 grid has 407 chunks and running LND on 720 ranks leads to:
That would avoid hard aborts and maybe just issue a warning instead?