E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
343 stars 352 forks source link

ERROR: get_proc_bounds with `ERS.f19_g16.I1850CLM45.cori-knl_intel.clm-vst` #3469

Closed ndkeen closed 4 years ago

ndkeen commented 4 years ago
 18:  urban net longwave radiation error: no convergence
 18:  clm model is stopping
 18:  calling getglobalwrite with decomp_index=         1616  and clmlevel= landunit
 18:  local  landunit index =         1616
 18:  ERROR: get_proc_bounds ERROR: Calling from inside  a threaded region
 18: Image              PC                Routine            Line        Source
 18: e3sm.exe           0000000002B8D8D6  Unknown               Unknown  Unknown
 18: e3sm.exe           000000000107485C  shr_abort_mod_mp_         114  shr_abort_mod.F90
 18: e3sm.exe           0000000000590A39  decompmod_mp_get_         394  decompMod.F90
 18: e3sm.exe           0000000000DE894F  getglobalvaluesmo          44  GetGlobalValuesMod.F90
 18: e3sm.exe           0000000000DE7B1C  getglobalvaluesmo         158  GetGlobalValuesMod.F90
 18: e3sm.exe           00000000004FAB45  abortutils_mp_end          69  abortutils.F90
 18: e3sm.exe           0000000000A3F952  urbanradiationmod         680  UrbanRadiationMod.F90
 18: e3sm.exe           0000000000A3C703  urbanradiationmod         224  UrbanRadiationMod.F90
 18: e3sm.exe           00000000004FC844  clm_driver_mp_clm         634  clm_driver.F90
 18: e3sm.exe           000000000169A0E3  Unknown               Unknown  Unknown
 18: e3sm.exe           00000000016519C0  Unknown               Unknown  Unknown
 18: e3sm.exe           0000000001650C1A  Unknown               Unknown  Unknown
 18: e3sm.exe           000000000169A499  Unknown               Unknown  Unknown
 18: e3sm.exe           0000000002439899  Unknown               Unknown  Unknown
 18: e3sm.exe           0000000002D150DF  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/m27-feb25/ERS.f19_g16.I1850CLM45.cori-knl_intel.clm-vst.r00
amametjanov commented 4 years ago

I can't reproduce with latest master, but please try with this patch to get a more informative abort from UrbanRadiationMod.F90

$ git diff
diff --git a/components/clm/src/main/GetGlobalValuesMod.F90 b/components/clm/src/main/GetGlobalValuesMod.F90
index c04b171..5daa824 100644
--- a/components/clm/src/main/GetGlobalValuesMod.F90
+++ b/components/clm/src/main/GetGlobalValuesMod.F90
@@ -41,7 +41,9 @@ contains
     integer                       :: beg_index     ! beginning proc index for clmlevel
     !----------------------------------------------------------------

+    !$omp master
     call get_proc_bounds(bounds_proc)
+    !$omp end master

     if (trim(clmlevel) == nameg) then
        beg_index = bounds_proc%begg
ndkeen commented 4 years ago

I should have noted that my test was with master as of today, but that's interesting that it did not fail for you. I have submitted a DEBUG test and no-threading case. Will also simply try again. And then try your suggestion.

ERS_PMx1.f19_g16.I1850CLM45.cori-knl_intel.clm-vst  passed
ERS.f19_g16.I1850CLM45.cori-knl_intel19.clm-vst  passed
ndkeen commented 4 years ago

I realize I never reported back on this. The DEBUG test also failed:

SMS_D_P64x1.f19_g16.I1850CLM45.cori-knl_intel.clm-vst

 5: forrtl: error (73): floating divide by zero
 5: Image              PC                Routine            Line        Source
 5: e3sm.exe           00000000069B22A4  Unknown               Unknown  Unknown
 5: e3sm.exe           0000000006275310  Unknown               Unknown  Unknown
 5: e3sm.exe           00000000020E2755  soilwatermovement         499  SoilWaterMovementMod.F90
 5: e3sm.exe           00000000020BCB3F  soilwatermovement         132  SoilWaterMovementMod.F90
 5: e3sm.exe           000000000194EEC3  hydrologynodraina         250  HydrologyNoDrainageMod.F90
 5: e3sm.exe           00000000008DF1C7  clm_driver_mp_clm         798  clm_driver.F90
 5: e3sm.exe           0000000000887D1C  lnd_comp_mct_mp_l         509  lnd_comp_mct.F90
 5: e3sm.exe           00000000004627B5  component_mod_mp_         737  component_mod.F90
 5: e3sm.exe           0000000000428C76  cime_comp_mod_mp_        2611  cime_comp_mod.F90
 5: e3sm.exe           000000000044A2C0  MAIN__                    133  cime_driver.F90
 5: e3sm.exe           0000000000401AE2  Unknown               Unknown  Unknown
 5: e3sm.exe           0000000006A8D30F  Unknown               Unknown  Unknown
 5: e3sm.exe           00000000004019CA  Unknown               Unknown  Unknown

I also just re-ran with a master of Apr 27th:

ERS_PMx1.f19_g16.I1850CLM45.cori-knl_intel.clm-vst   passes

and

ERS_D.f19_g16.I1850CLM45.cori-knl_intel19.clm-vst

fails with 

 61: forrtl: error (73): floating divide by zero
 61: Image              PC                Routine            Line        Source
 61: e3sm.exe           000000000692E8F4  Unknown               Unknown  Unknown
 61: e3sm.exe           00000000061F13E0  Unknown               Unknown  Unknown
 61: e3sm.exe           00000000020B823F  soilwatermovement         499  SoilWaterMovementMod.F90
 61: e3sm.exe           0000000002094401  soilwatermovement         132  SoilWaterMovementMod.F90
 61: e3sm.exe           0000000001960A62  hydrologynodraina         250  HydrologyNoDrainageMod.F90
 61: e3sm.exe           00000000008E66B3  clm_driver_mp_clm         798  clm_driver.F90
 61: e3sm.exe           00000000053E63A3  Unknown               Unknown  Unknown
 61: e3sm.exe           000000000539064A  Unknown               Unknown  Unknown
 61: e3sm.exe           000000000538F6A1  Unknown               Unknown  Unknown
 61: e3sm.exe           00000000053E678A  Unknown               Unknown  Unknown
 61: e3sm.exe           00000000061EC339  Unknown               Unknown  Unknown
 61: e3sm.exe           0000000006A95C5F  Unknown               Unknown  Unknown

As this looks like same error I saw with a previous repo using SMS_D_P64x1.f19_g16.I1850CLM45.cori-knl_intel.clm-vst this repo will probably fail in same way (trying now).

ndkeen commented 4 years ago

Closing issue as this error is now the same as https://github.com/E3SM-Project/E3SM/issues/2243