E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
341 stars 348 forks source link

Fail with coupled hires using 825 nodes on cori-knl #1955

Closed ndkeen closed 6 years ago

ndkeen commented 6 years ago

Saving this here.

case: /global/cscratch1/sd/ndk/ACME_simulations/2017-master-nov28.hmod825.mnov28.st04.5d.nr.wh0.atune.ne120np4_oRRS18to6v3_ICG.cori-knl

42138: forrtl: error (78): process killed (SIGTERM)
42138: Image              PC                Routine            Line        Source             
42138: acme.exe           00000000055F48D1  Unknown               Unknown  Unknown
42138: acme.exe           00000000055F2A0B  Unknown               Unknown  Unknown
42138: acme.exe           0000000005595A94  Unknown               Unknown  Unknown
42138: acme.exe           00000000055958A6  Unknown               Unknown  Unknown
42138: acme.exe           00000000054FDDB9  Unknown               Unknown  Unknown
42138: acme.exe           0000000005509B6C  Unknown               Unknown  Unknown
42138: acme.exe           0000000004EC6010  Unknown               Unknown  Unknown
42138: acme.exe           00000000050F9E04  Unknown               Unknown  Unknown
42138: acme.exe           0000000005108AB0  Unknown               Unknown  Unknown
42138: acme.exe           0000000004FFD76B  Unknown               Unknown  Unknown
42138: acme.exe           0000000004FFDFBA  Unknown               Unknown  Unknown
42138: acme.exe           000000000500FB0A  Unknown               Unknown  Unknown
42138: acme.exe           0000000003A3204A  piolib_mod_mp_pio        2804  piolib_mod.F90
42138: acme.exe           000000000054E299  cam_pio_utils_mp_        1106  cam_pio_utils.F90
42138: acme.exe           0000000001D4A954  phys_prop_mp_phys         234  phys_prop.F90
42138: acme.exe           0000000000654CB1  rad_constituents_         432  rad_constituents.F90
42138: acme.exe           0000000000629A84  physpkg_mp_phys_i         780  physpkg.F90
42138: acme.exe           0000000000502BD2  cam_comp_mp_cam_i         178  cam_comp.F90
42138: acme.exe           00000000004F4592  atm_comp_mct_mp_a         260  atm_comp_mct.F90
42138: acme.exe           000000000042AECE  component_mod_mp_         231  component_mod.F90
42138: acme.exe           0000000000419A92  cime_comp_mod_mp_        1180  cime_comp_mod.F90
42138: acme.exe           0000000000427D6F  MAIN__                     92  cime_driver.F90
42138: acme.exe           000000000040AF1E  Unknown               Unknown  Unknown
42138: acme.exe           0000000005614D59  Unknown               Unknown  Unknown
42138: acme.exe           000000000040AE09  Unknown               Unknown  Unknown

I'm trying a DEBUG=TRUE build as well as a run without cosp now. ?

singhbalwinder commented 6 years ago

I generally get SIGTERM when the job run out of time. Is that the case here? Does it hang somewhere and then dies when it is out of time? It looks like it died trying to read some physical properties of aerosols.

ndkeen commented 6 years ago

Whoops. Yes, good catch @singhbalwinder . This was a job I cancelled as it was during a time when several jobs were taking too long to initialize. I forgot about that. I will try re-running and will close this issue.