Closed worleyph closed 8 years ago
Repeat with 117 node layout: ocean on its own processes, using 1440x2 PE layout. Error is identical (so still not much information). Will try same layout again, but with 1440x1, just to verify.
Repeat with 117 node layout but without threading in the ocean worked fine. @amametjanov or @douglasjacobsen , have you tested MPAS-O with threading in a coupled run on Edison? (Note that this did work for me on Titan, at least once - haven't tried repeating my Titan success.)
@worleyph I think I was trying on cori last time, but I'd expect the failures you're seeing to be repeatable there.
I haven't had a chance to try it recently though.
It's not clear from your description how I can repeat. I should just manually edit env_mach_pes.xml?
@ndkeen , Please look in
/global/homes/w/worleyph/ACME/master/ACME/cime/scripts/A_WCYCL2000.ne30_oEC.edison_master_ocean_openmp_test
(should be readable).
./create_newcase -case <your name here> -compset A_WCYCL2000 -res ne30_oEC -mach edison -compiler intel -project acme
Then rename one of my "wOMP" PE layouts in the directory to be env_mach_pes.xml .
followed by the usual
./cesm_setup
./XXX.build
./XXX.submit
Doesn't look like I changed anything else for these experiments.
Update: following suggestion by (I think @amametjanov ) I tried MPAS-O with threading on Titan using the intel compiler. This also worked. This was using intel/15.0.2.164 . The failed runs on Edison used intel/15.0.1.133 . @ndkeen , have you had a chance to try to duplicate my failure?
This slipped thru my task list. I'm running a couple of jobs now.
I tried the env_mach_pes.xml_wOMP
in @worleyph directory (though I did change max tasks to 24 from 48) and I get what looks like same error as him. No stack trace or other information except this in cesm.log:
000: Setting mpi info: striping_factor=16
000: Setting mpi info: striping_unit=1048576
000: pionfput_mod.F90 128 1 1 64 1
000: 0001-01-01_00:00:00
srun: error: nid01768: task 10: Exited with exit code 174
srun: Terminating job step 514057.0
srun: error: nid01805: task 454: Exited with exit code 174
000: slurmstepd: *** STEP 514057.0 ON nid01768 CANCELLED AT 2016-06-08T19:32:44 ***
I also ran another run with 1 thread and it completed without issues.
I was using next.
I tried a debug run. One with 1 thread -- it didn't have enough time to finish, but no errors. With 2 threads it does fail:
000: pionfput_mod.F90 128 1 1 64 1
000: 0001-01-01_00:00:00
108: forrtl: severe (174): SIGSEGV, segmentation fault occurred
384: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
384:
217: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
217:
288: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
288:
026: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
026:
393: forrtl: severe (408): fort: (7): Attempt to use pointer MAXLEVELCELL when it is not associated with a targe
393:
220: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
220:
008: forrtl: severe (408): fort: (7): Attempt to use pointer CONFIG_N_TS_ITER when it is not associated with a t
rget
008:
170: forrtl: severe (408): fort: (3): Subscript #1 of the array NORMALVELOCITYCUR has value 1 which is less than
the lower bound of 4607182418800017408
170:
132: forrtl: severe (408): fort: (2): Subscript #2 of the array NORMALBAROCLINICVELOCITYCUR has value 1270 which
is greater than the upper bound of 12
132:
307: forrtl: severe (408): fort: (3): Subscript #2 of the array NORMALBAROCLINICVELOCITYCUR has value 1 which is
less than the lower bound of 140737488238024
307:
111: forrtl: severe (408): fort: (7): Attempt to use pointer NEDGES when it is not associated with a target
111:
154: forrtl: severe (408): fort: (3): Subscript #1 of the array NORMALBAROTROPICVELOCITYCUR has value 1 which is
less than the lower bound of 4607182418800017408
Using a different intel compiler version, and debug, with 2 threads, I get a slightly diff error which may or may not help us:
000: pionfput_mod.F90 128 1 1 64 1
000: 0001-01-01_00:00:00
184: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaaed5e1b30 ***
106: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaac8112f50 ***
280: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaaf4d0ad30 ***
107: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaad00fe170 ***
239: *** glibc detected *** /scratch2/scratchdirs/ndk/acme_scratch/n04g904coupled05i17debug/bld/cesm.exe: malloc(): memor\
y corruption: 0x00002aaabc117b40 ***
356: forrtl: severe (408): fort: (3): Subscript #1 of the array TABLE has value 1215 which is less than the lower bound o\
f 4294967297
356:
278: forrtl: severe (174): SIGSEGV, segmentation fault occurred
359: forrtl: severe (408): fort: (7): Attempt to use pointer NVERTLEVELS when it is not associated with a target
359:
325: forrtl: severe (408): fort: (7): Attempt to use pointer PTR when it is not associated with a target
I just now got a seg. fault on Titan/pgi in the ocean when using threading (and it worked without threading). I'll verify that this is repeatable.
Note that I have been running GMPAS at oRRS15to5 with threading with no problems for the past week.
Just now repeated my original case, but with threading enabled in ICE. This time it worked fine. I'll repeat the larger experiment. If that succeeds, I'll close this issue.
Larger reproducer also now working when build mpas-cice with threading. Closing this issue.
After running successfully on Titan, and having compile issues on Cetus, tried running A_WCYCL2000 ne30_oEC on Edison with threading in MPAS-O. This is failing at runtime, I assume because of threading in the ocean.
Experiment:
and then ocean (and all other components)
except for ICE, which had only one thread, with
This failed at runtime. Repeated the experiment but without threading in the ocean:
This ran successsfully. Repeated first experiment, and this failed in an identical way to the first experiment.
with no relevant information in the log files:
cpl.log:
cesm.log:
ocn.log:
ice.log:
atm.log:
I'll try with more processes and without hyperthreading, but would like someone to try to repeat my experiment and independently verify the failure. Assigning this to @ndkeen , but hopefully @douglasjacobsen and @amametjanov can advise or perhaps take over this task, as appropriate.