Closed worleyph closed 8 years ago
Pat, I couldn't reproduce. A case with
led to an abort with exit code 127 during initialization of ICE (git # bacb73afe53d4676921ed8e7fcbf2dcf7516e44d).
Case and run dirs:
Can you share your case-dir for FC5AV1C?
Please check out
~worley/ACPI/SVN/ACME/master-test2/ACME/cime/scripts/A_WCYCL2000_ne30_oEC_titan_pgi_Bb
/lustre/atlas1/cli112/scratch/worley/FC5AV1C-01_ne30_oEC_titan_pgi_5400/run
Running in debug mode shows an out-of-bounds access:
0: Subscript out of range for array buffer%receive (components/homme/src/share/bndry_mod.F90: 229)
subscript=-5759, lower bound=1, upper bound=46080, dimension=1
It looks like buffer%moveptr gets corrupted, because ithr is still valid in MPI-only mode (ithr==0).
did you happen to get a traceback?
There was no stack-trace, log files are here:
/lustre/atlas1/cli112/scratch/azamat/FC5AV1C-01-ne30_oEC-01-5400/run
~azamat/repos/ACME/cime/scripts/cases/FC5AV1C-01-ne30_oEC-01-5400
Update: the run with default PGI compiler above suggested out-of-bounds access, but switching to Intel shows floating overflow and in both cases the stack traces are getting corrupted with debug flags turned on.
Opened existing file
/lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/inic/homme/cami_mam3_
Linoz_ne30np4_L72_c160214.nc 0
Opened existing file
/lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/topo/USGS-gtopo30_ne3
0np4_16xdel2-PFC-consistentSGH.nc 1
getMetaSchedule: tmpP: 4 1 3 5
3070 6 2 10 2395 14
3071 18 2394 19 -1 20
WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
forrtl: error (72): floating overflow
Image PC Routine Line Source
cesm.exe 0000000011976531 Unknown Unknown Unknown
cesm.exe 0000000011974C87 Unknown Unknown Unknown
cesm.exe 0000000011927124 Unknown Unknown Unknown
cesm.exe 0000000011926F36 Unknown Unknown Unknown
cesm.exe 00000000118B4AEF Unknown Unknown Unknown
cesm.exe 00000000118BFEA9 Unknown Unknown Unknown
cesm.exe 00000000114AF020 Unknown Unknown Unknown
cesm.exe 00000000103368F6 Unknown Unknown Unknown
Stack trace terminated abnormally.
Hi @amametjanov , any new news on this topic? If not, I'll jump back in (at least to verify that nothing has changed).
Hi Pat, no, the error is still there with the recent git version v1.0.0-alpha.5-53-gf18a6e8. Ruled out bad data with different ncdata files and bad PIO settings with different strides. The error is during graph decomposition and allocation of edges. Looking into why tmpP(15) is -1 and trying on Edison. Please jump in with any other ideas.
@amametjanov , FYI, I am also getting failures on Cetus with a similar experiment:
-compset FC5AV1C-L -res ne30_oEC
5400x1 in atmosphere
5408x1 for all other components
<entry id="MAX_TASKS_PER_NODE" value="8" />
(where 2700x1 and MAX_TASKS_PER_NODE=4 works fine). bgq_stack indicates that a memory issue (TLB?) in
00000000020e6e90
bndry_exchangev_threaded
/gpfs/mira-home/worley/ACME/master/ACME/components/homme/src/share/bndry_mod_base.F90:480
and cesm.log indicates that
2016-05-26 18:06:39.304 (WARN ) [0xfff7aa9c8e0] CET-40000-73731-1024:1689531:ibm.runjob.client.Job: terminated by signal 11
2016-05-26 18:06:39.305 (WARN ) [0xfff7aa9c8e0] CET-40000-73731-1024:1689531:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 2196
In contrast, on Titan I am tracking a memory issue inside of the pack/unpack routines in
cam/src/physics/cam/micro_mg_data.F90
However, I am leaning toward this being a code issue since it occurs around the same time (end of intialization on Titan and early in first timestep on Cetus).
Could you please try to reproduce this error on Cetus?
Thanks.
Update: on Cetus, using 2700x1 in ATM and 2712x1 otherwise with MAX_TASKS_PER_NODE=8 also works fine, so the issue with 5400x1 ATM is not (obviously) due to a too large memory footprint. Note that I am using
<entry id="PIO_STRIDE" value="128" />
to avoid PIO issues that arise when using the default (== 4).
Update: I was sloppy in my experimental design, and mixed up two different issues.
a) pgi/16.3 does not like the microphysics code, and aborts at runtime with malloc errors or segmentation faults or arithmetic exceptions (probably all memory related). This occurs at more than just maximum scale. As pgi/16.3 is not the default version that we are using in the code, this is easily avoided, but we will need to track whether this will hurt us in the future. I had no luck pinpointing the location, but am pretty sure that it is just in the microphysics.
b) pgi/15.3 demonstrates the originally reported problem, and is in the dynamics and shows up in the MPI logic (though the source may be elsewhere). pgi/15.3 on TItan and the Cetus run (so IBM compiler) may have the same error signature? I'll move my focus to this problem, and ignore the pgi/16.3 problem for the moment.
Sorry for the confusion.
Tracked down the problem to the call to initEdgeSBuffer, which calls initEdgeBuffer with nMethod set to .TRUE. (All of the other calls to initEdgeBuffer use the "old" method.)
(This shows up when calling neighbor_minmax, which calls bndry_exchangeS and uses the schedule information calculated in initEdgeSBuffer.)
When using the new method, you can have
pSchedule%MoveCycle(1)%lengthS = 1
and
pSchedule%MoveCycle(1)%ptrS = 0
which then leads to positive moveLength and negative pointer values when nlyr > 0.
moveLength = nlyr*pSchedule%MoveCycle(1)%lengthS
ptr = nlyr*(pSchedule%MoveCycle(1)%ptrS -1) + 1
In contrast, for the old method, when
pSchedule%MoveCycle(1)%ptrP = 0
we also have
pSchedule%MoveCycle(1)%lengthP = 0.
The relevant code is in edge_mod_base.F90 .
I know way too little to try to fix this. (Is it an error to use both the old and new methods? Is it a bug for lengthS to be > 0 and ptrS == 0, or is the error in the formula for moveLength and ptr for new method? Is this a bug in other situations as well, e.g., when there is more than one cell per process?)
Reassigning this to the HOMME expert ( @mt5555 ).
Also changed this to critical as this may indicate a bug even when not using maximum parallelism.
I confirmed this bug in standalone HOMME, with ne16 on 1536 MPI tasks. ptr < 0, although the code still runs on the SNL linux cluster, but surely producing incorrect results.
On 768 MPI tasks, ptr is always >=1.
will debug further....
I was able to reproduce this behavior with ne4 on 96 cores. This one line fix (missing initialization) fixes it in that case:
--- a/components/homme/src/share/schedule_mod.F90
+++ b/components/homme/src/share/schedule_mod.F90
@@ -133,6 +133,7 @@ contains
LSchedule%MoveCycle(1)%ptrP = 0
LSchedule%MoveCycle(1)%ptrS = 0
LSchedule%MoveCycle(1)%lengthP = 0
+ LSchedule%MoveCycle(1)%lengthS = 0
if(Debug) write(iulog,*)'genEdgeSched: point #6'
!==================================================================
Thanks. Giving it a try.
@gold2718 : it looks like this bugfix should also be applied to the NCAR version of HOMME.
@mt5555 , the fix eliminated the seg. fault in my reproducer on Titan/pgi
-compset FC5AV1C-L -res ne30_oEC
with a PE layout of 5400x1, and I ran successfully for one simulated day.
fyi, i'm also finding the same error as Az in the coupled simulation I'm running on edison. I will try the fix.
Is this ready for a PR?
On Titan, when running either A_WCYCL2000 or FC5AV1C with ne30 atmosphere resolution, setting the number of MPI processes to 5400 leads to an error abort of the form:
I have verified that it is not happening in repro_sum or dp_coupling. Adding write statements, it appears to be occurring in the first call to prim_step. With these write statements, the PMPI error message disappears and you only get
Does not mean that the problem is not occurring earlier, and just showing up here, but this is where I have tracked it so far.
SInce I have "bounced" so far with this, I'd like others to take a look as well. It appears to be Titan-specific, but this should also be verified on other systems. (Note that using 2700 MPI processes works fine.)