E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
344 stars 351 forks source link

MPAS-O with threading runtime failure on Edison #904

Closed worleyph closed 8 years ago

worleyph commented 8 years ago

After running successfully on Titan, and having compile issues on Cetus, tried running A_WCYCL2000 ne30_oEC on Edison with threading in MPAS-O. This is failing at runtime, I assume because of threading in the ocean.

Experiment:

 <entry id="NTASKS_ATM"   value="450"  />
 <entry id="NTHRDS_ATM"   value="2"  />
 <entry id="ROOTPE_OCN"   value="0"  />

and then ocean (and all other components)

 <entry id="NTASKS_OCN"   value="456"  />
 <entry id="NTHRDS_OCN"   value="2"  />
 <entry id="ROOTPE_OCN"   value="0"  />

except for ICE, which had only one thread, with

 <entry id="MAX_TASKS_PER_NODE"   value="48"  />
 <entry id="PES_PER_NODE"   value="24"  />

This failed at runtime. Repeated the experiment but without threading in the ocean:

 <entry id="NTASKS_OCN"   value="456"  />
 <entry id="NTHRDS_OCN"   value="1"  />
 <entry id="ROOTPE_OCN"   value="0"  />

This ran successsfully. Repeated first experiment, and this failed in an identical way to the first experiment.

 srun: error: nid00137: task 315: Exited with exit code 174
 srun: Terminating job step 469461.0
 srun: error: nid00139: task 374: Exited with exit code 174

with no relevant information in the log files:

cpl.log:

 (seq_mct_drv) : Model initialization complete
 ...
 (prep_ice_merge) x2i%Fixx_rofi = = (g2x%Figg_rofi + r2x%Firr_rofi)*flux_epbalfact

cesm.log:

 000:  ----- done parsing run-time I/O from streams.cice -----
 000:
 000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
 000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 000:  Setting mpi info: striping_factor=16
 000:  Setting mpi info: striping_unit=1048576
 000:  pionfput_mod.F90         128           1           1          64           1
 000:  0001-01-01_00:00:00

ocn.log:

  Initial time 0001-01-01_00:30:00

ice.log:

  Doing timestep 0001-01-01_01:00:00
     Starting analysis precompute
     Finished analysis precompute
 ...
     Writing output streams
     Finished writing output streams
  Completed timestep 0001-01-01_01:00:00

atm.log:

  nstep, te        1   0.33533575202015142E+10   0.33534131484464006E+10   0.30760466121831460E-02   0.98531023654995137E+05

I'll try with more processes and without hyperthreading, but would like someone to try to repeat my experiment and independently verify the failure. Assigning this to @ndkeen , but hopefully @douglasjacobsen and @amametjanov can advise or perhaps take over this task, as appropriate.

worleyph commented 8 years ago

Repeat with 117 node layout: ocean on its own processes, using 1440x2 PE layout. Error is identical (so still not much information). Will try same layout again, but with 1440x1, just to verify.

worleyph commented 8 years ago

Repeat with 117 node layout but without threading in the ocean worked fine. @amametjanov or @douglasjacobsen , have you tested MPAS-O with threading in a coupled run on Edison? (Note that this did work for me on Titan, at least once - haven't tried repeating my Titan success.)

douglasjacobsen commented 8 years ago

@worleyph I think I was trying on cori last time, but I'd expect the failures you're seeing to be repeatable there.

I haven't had a chance to try it recently though.

ndkeen commented 8 years ago

It's not clear from your description how I can repeat. I should just manually edit env_mach_pes.xml?

worleyph commented 8 years ago

@ndkeen , Please look in

 /global/homes/w/worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.edison_master_ocean_openmp_test

(should be readable).

 ./create_newcase -case <your name here> -compset A_WCYCL2000 -res ne30_oEC -mach edison -compiler intel -project acme

Then rename one of my "wOMP" PE layouts in the directory to be env_mach_pes.xml .

followed by the usual

 ./cesm_setup
 ./XXX.build
 ./XXX.submit

Doesn't look like I changed anything else for these experiments.

worleyph commented 8 years ago

Update: following suggestion by (I think @amametjanov ) I tried MPAS-O with threading on Titan using the intel compiler. This also worked. This was using intel/15.0.2.164 . The failed runs on Edison used intel/15.0.1.133 . @ndkeen , have you had a chance to try to duplicate my failure?

ndkeen commented 8 years ago

This slipped thru my task list. I'm running a couple of jobs now.

ndkeen commented 8 years ago

I tried the env_mach_pes.xml_wOMP in @worleyph directory (though I did change max tasks to 24 from 48) and I get what looks like same error as him. No stack trace or other information except this in cesm.log:

000:  Setting mpi info: striping_factor=16
000:  Setting mpi info: striping_unit=1048576
000:  pionfput_mod.F90         128           1           1          64           1 
000:  0001-01-01_00:00:00
srun: error: nid01768: task 10: Exited with exit code 174
srun: Terminating job step 514057.0
srun: error: nid01805: task 454: Exited with exit code 174
000: slurmstepd: *** STEP 514057.0 ON nid01768 CANCELLED AT 2016-06-08T19:32:44 ***

I also ran another run with 1 thread and it completed without issues.

I was using next.

ndkeen commented 8 years ago

I tried a debug run. One with 1 thread -- it didn't have enough time to finish, but no errors. With 2 threads it does fail:

000:  pionfput_mod.F90         128           1           1          64           1 
000:  0001-01-01_00:00:00
108: forrtl: severe (174): SIGSEGV, segmentation fault occurred
384: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
384: 
217: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
217: 
288: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
288: 
026: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
026: 
393: forrtl: severe (408): fort: (7): Attempt to use pointer MAXLEVELCELL when it is not associated with a targe 
393: 
220: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
220: 
008: forrtl: severe (408): fort: (7): Attempt to use pointer CONFIG_N_TS_ITER when it is not associated with a t 
rget
008: 
170: forrtl: severe (408): fort: (3): Subscript #1 of the array NORMALVELOCITYCUR has value 1 which is less than 
the lower bound of 4607182418800017408
170: 
132: forrtl: severe (408): fort: (2): Subscript #2 of the array NORMALBAROCLINICVELOCITYCUR has value 1270 which 
is greater than the upper bound of 12
132: 
307: forrtl: severe (408): fort: (3): Subscript #2 of the array NORMALBAROCLINICVELOCITYCUR has value 1 which is 
less than the lower bound of 140737488238024
307: 
111: forrtl: severe (408): fort: (7): Attempt to use pointer NEDGES when it is not associated with a target
111: 
154: forrtl: severe (408): fort: (3): Subscript #1 of the array NORMALBAROTROPICVELOCITYCUR has value 1 which is 
less than the lower bound of 4607182418800017408
ndkeen commented 8 years ago

Using a different intel compiler version, and debug, with 2 threads, I get a slightly diff error which may or may not help us:

000:  pionfput_mod.F90         128           1           1          64           1
000:  0001-01-01_00:00:00
184: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaaed5e1b30 ***
106: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaac8112f50 ***
280: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaaf4d0ad30 ***
107: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaad00fe170 ***
239: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaabc117b40 ***
356: forrtl: severe (408): fort: (3): Subscript #1 of the array TABLE has value 1215 which is less than the lower bound o\
f 4294967297
356:
278: forrtl: severe (174): SIGSEGV, segmentation fault occurred
359: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
359:
325: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
worleyph commented 8 years ago

I just now got a seg. fault on Titan/pgi in the ocean when using threading (and it worked without threading). I'll verify that this is repeatable.

Note that I have been running GMPAS at oRRS15to5 with threading with no problems for the past week.

worleyph commented 8 years ago

Just now repeated my original case, but with threading enabled in ICE. This time it worked fine. I'll repeat the larger experiment. If that succeeds, I'll close this issue.

worleyph commented 8 years ago

Larger reproducer also now working when build mpas-cice with threading. Closing this issue.