Clamp number of LND MPI ranks when over the number of available chunks

amametjanov commented 6 years ago

For example, ne4 grid has 407 chunks and running LND on 720 ranks leads to:

 decompInit_lnd(): Number of processes exceeds number of land grid cells
         720         407
 ENDRUN:
 ERROR in decompInitMod.F90 at line 168

That would avoid hard aborts and maybe just issue a warning instead?

ndkeen commented 6 years ago

I have been complaining about this directly to @bishtgautam :) I hit this more than usual as I have been changing PE layouts often. I do think the code should do something beside what it does. I understand it's not easy to know if there will be too many tasks for LND before the job runs, and it may take some work to change how this is handled to reduce total LND tasks when needed. I think it is not just a limit on the LND MPI tasks, but the total number of threads -- as I sometimes hit this when I increase threads (and then have to leave LND threads at 1 to work-around).

bishtgautam commented 6 years ago

If LND_TASKS * LND_THRDS > Total number of active land grid cells, the land model will stop. The land F90 code cannot handle the above-mentioned processor layout.

AaronDonahue commented 5 years ago

My experience is that it gets worse than this. I can't speak for the current master, but on maint-1.0 if you accidentally assign more tasks than the total number of grid cells the whole run is corrupted. Meaning that if you go back and simply fix the PE-Layout to work with LND the job will still fail with the following error:

0: CalcWorkPerBlock: Total blocks:      24301 Ice blocks:      24301 IceFree blocks:          0 Land blocks:          0
188: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x0000000018942f70 ***
  2: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x00000000189e2100 ***
189: *** Error in `/global/cscratch1/sd/adonahue/E3SM_simulations/CMDV_SE/Scaling/ne30_ne30/cori-knl.E3SM.cmdv_ps_00432_0216.ne30_ne30/build/e3sm.exe': free(): corrupted unsorted chunks: 0x000000001b3e12d0 ***
.
.
.

The only solution I have found is to delete the build and start over again with the correct PE-Layout.

amametjanov commented 5 years ago

This output appears to be coming from the older CICE component (not MPAS-Seaice) that is active in F-compsets. PE layout changes in F-cases require reset and re-build. Please add case.setup -r && case.build to the workflow (delete is not necessary).

AaronDonahue commented 5 years ago

ahh ok, thanks. I will try this. As a perhaps slightly related question, is it possible that a particular PE-Layout can cause a floating point invalid to be formed and then passed throughout the code?

I have a branch off of maint-1.0 that I've been running with many different PE-Layouts and have had some jobs work and some jobs fail. The jobs that fail all have this error:

0: CalcWorkPerBlock: Total blocks:      24301 Ice blocks:      24301 IceFree blocks:          0 Land blocks:          0
5427: forrtl: error (65): floating invalid
5427: Image              PC                Routine            Line        Source
5427: e3sm.exe           000000000431FD7E  Unknown               Unknown  Unknown
5427: e3sm.exe           0000000003B8AF00  Unknown               Unknown  Unknown
5427: e3sm.exe           0000000001084B49  clubb_intr_mp_clu        1584  clubb_intr.F90
5427: e3sm.exe           00000000006051EB  physpkg_mp_tphysb        2489  physpkg.F90
5427: e3sm.exe           00000000006017C3  physpkg_mp_phys_r        1038  physpkg.F90
5427: e3sm.exe           00000000004F03B7  cam_comp_mp_cam_r         251  cam_comp.F90
5427: e3sm.exe           00000000004E264F  atm_comp_mct_mp_a         341  atm_comp_mct.F90
5427: e3sm.exe           000000000042C29F  component_mod_mp_         267  component_mod.F90
5427: e3sm.exe           00000000004203BD  cime_comp_mod_mp_        1958  cime_comp_mod.F90
5427: e3sm.exe           000000000042910C  MAIN__                     92  cime_driver.F90
5427: e3sm.exe           000000000040A80E  Unknown               Unknown  Unknown
5427: e3sm.exe           00000000043FF479  Unknown               Unknown  Unknown
5427: e3sm.exe           000000000040A6F9  Unknown               Unknown  Unknown

But I can't imagine that CLUBB is actually to blame since the same exact code worked and ran to completion with a different PE-Layout.

singhbalwinder commented 5 years ago

But I can't imagine that CLUBB is actually to blame since the same exact code worked and ran to completion with a different PE-Layout.

which machine are you on?

AaronDonahue commented 5 years ago

I'm running on Cori-KNL. I should say that I am running a branch with additions I made to run physics/dynamics in parallel, but I never ran into this problem with the same branch on Livermore Computing machines.

Also, the floating point invalid only happens at O(1) cores.

singhbalwinder commented 5 years ago

This may be something similar to #1183 , although the error is different here. If you haven't changed clubb_intr.F90 then the line it is crashing at is: https://github.com/E3SM-Project/E3SM/blob/a4ac51d295751309bcb4193e2bc6d91c1bb4ee51/components/cam/src/physics/cam/clubb_intr.F90#L1584

Therefore, for some reason, there is something wrong with either state1%pdel(i,pver-k+1) or qrl(i,pver-k+1) variable when it crashes.

AaronDonahue commented 5 years ago

Yes, and it's a difficult bug to reproduce since sometimes recompiling and running will work. @ndkeen pointed out that compiling with -fpe0 might be part of the problem, which I see is discussed in #1183 as you pointed out. I'm running maint-1.0 on the FC5AV1C-L compset and checking the atm.bldlog file the compilation has the -fpe0 flag.

amametjanov commented 5 years ago

Also, the floating point invalid only happens at O(1) cores.

This might also be due to an OOM issue: too few tasks allocated to physics.

AaronDonahue commented 5 years ago

@amametjanov , I've been able to monitor a few more jobs running (and failing). I looks like when it does fail it is when there are a lot of cores assigned to physics, in particular, more than there are the number of elements in the grid. (i.e. > 5400 for ne30). But I caution to say this the root of the problem since I have also had a number of jobs with lots of cores assigned to physics also run (e.g. ATM-PE = 12,151, dynamics elements = 5400) ran just fine.

I have also had jobs fail with the same issue which have had 1/2 as many cores as elements.

I haven't had the problem on Livermore Computing, so I wonder if it is unique to Cori-KNL. I'm going to experiment with master and with maint-1.0 without my changes to see if I can isolate the issue to changes I made.

E3SM-Project / E3SM

Clamp number of LND MPI ranks when over the number of available chunks #1952