Seg. fault / PMPI_Waitall with large negative counts when using max MPI parallelism in CAM-SE

worleyph commented 8 years ago

On Titan, when running either A_WCYCL2000 or FC5AV1C with ne30 atmosphere resolution, setting the number of MPI processes to 5400 leads to an error abort of the form:

  PMPI_Waitall(282): Negative count, value is -1071613056

I have verified that it is not happening in repro_sum or dp_coupling. Adding write statements, it appears to be occurring in the first call to prim_step. With these write statements, the PMPI error message disappears and you only get

  PE RANK 904 exit signal Segmentation fault

Does not mean that the problem is not occurring earlier, and just showing up here, but this is where I have tracked it so far.

SInce I have "bounced" so far with this, I'd like others to take a look as well. It appears to be Titan-specific, but this should also be verified on other systems. (Note that using 2700 MPI processes works fine.)

amametjanov commented 8 years ago

Pat, I couldn't reproduce. A case with

create_newcase -compset A_WCYCL2000 -res ne30_oEC -mach titan, and
5400x1 for ATM on its own nodes and other PE mods in the case-dir below

led to an abort with exit code 127 during initialization of ICE (git # bacb73afe53d4676921ed8e7fcbf2dcf7516e44d).

Case and run dirs:

~azamat/repos/ACME/cime/scripts/cases/A_WCYCL2000_ne30oEC_03-atm-chk
/lustre/atlas/scratch/azamat/cli112/A_WCYCL2000_ne30oEC_03-atm-chk/run

Can you share your case-dir for FC5AV1C?

worleyph commented 8 years ago

Please check out

 ~worley/ACPI/SVN/ACME/master-test2/ACME/cime/scripts/A_WCYCL2000_ne30_oEC_titan_pgi_Bb

 /lustre/atlas1/cli112/scratch/worley/FC5AV1C-01_ne30_oEC_titan_pgi_5400/run

amametjanov commented 8 years ago

Running in debug mode shows an out-of-bounds access:

0: Subscript out of range for array buffer%receive (components/homme/src/share/bndry_mod.F90: 229)
    subscript=-5759, lower bound=1, upper bound=46080, dimension=1

It looks like buffer%moveptr gets corrupted, because ithr is still valid in MPI-only mode (ithr==0).

mt5555 commented 8 years ago

did you happen to get a traceback?

amametjanov commented 8 years ago

There was no stack-trace, log files are here:

rundir: /lustre/atlas1/cli112/scratch/azamat/FC5AV1C-01-ne30_oEC-01-5400/run
casedir: ~azamat/repos/ACME/cime/scripts/cases/FC5AV1C-01-ne30_oEC-01-5400

amametjanov commented 8 years ago

Update: the run with default PGI compiler above suggested out-of-bounds access, but switching to Intel shows floating overflow and in both cases the stack traces are getting corrupted with debug flags turned on.

 Opened existing file  
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/inic/homme/cami_mam3_
 Linoz_ne30np4_L72_c160214.nc           0     
 Opened existing file  
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/topo/USGS-gtopo30_ne3
 0np4_16xdel2-PFC-consistentSGH.nc           1     

 getMetaSchedule: tmpP:            4           1           3           5     
        3070           6           2          10        2395          14    
        3071          18        2394          19          -1          20    
 WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source      
cesm.exe           0000000011976531  Unknown               Unknown  Unknown
cesm.exe           0000000011974C87  Unknown               Unknown  Unknown
cesm.exe           0000000011927124  Unknown               Unknown  Unknown
cesm.exe           0000000011926F36  Unknown               Unknown  Unknown
cesm.exe           00000000118B4AEF  Unknown               Unknown  Unknown
cesm.exe           00000000118BFEA9  Unknown               Unknown  Unknown
cesm.exe           00000000114AF020  Unknown               Unknown  Unknown
cesm.exe           00000000103368F6  Unknown               Unknown  Unknown

Stack trace terminated abnormally.

worleyph commented 8 years ago

Hi @amametjanov , any new news on this topic? If not, I'll jump back in (at least to verify that nothing has changed).

amametjanov commented 8 years ago

Hi Pat, no, the error is still there with the recent git version v1.0.0-alpha.5-53-gf18a6e8. Ruled out bad data with different ncdata files and bad PIO settings with different strides. The error is during graph decomposition and allocation of edges. Looking into why tmpP(15) is -1 and trying on Edison. Please jump in with any other ideas.

worleyph commented 8 years ago

@amametjanov , FYI, I am also getting failures on Cetus with a similar experiment:

 -compset FC5AV1C-L -res ne30_oEC

 5400x1 in atmosphere
 5408x1 for all other components
 <entry id="MAX_TASKS_PER_NODE"   value="8"  />

(where 2700x1 and MAX_TASKS_PER_NODE=4 works fine). bgq_stack indicates that a memory issue (TLB?) in

 00000000020e6e90
 bndry_exchangev_threaded
 /gpfs/mira-home/worley/ACME/master/ACME/components/homme/src/share/bndry_mod_base.F90:480

and cesm.log indicates that

 2016-05-26 18:06:39.304 (WARN ) [0xfff7aa9c8e0] CET-40000-73731-1024:1689531:ibm.runjob.client.Job: terminated by signal 11
 2016-05-26 18:06:39.305 (WARN ) [0xfff7aa9c8e0] CET-40000-73731-1024:1689531:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 2196

In contrast, on Titan I am tracking a memory issue inside of the pack/unpack routines in

 cam/src/physics/cam/micro_mg_data.F90

However, I am leaning toward this being a code issue since it occurs around the same time (end of intialization on Titan and early in first timestep on Cetus).

Could you please try to reproduce this error on Cetus?

Thanks.

worleyph commented 8 years ago

Update: on Cetus, using 2700x1 in ATM and 2712x1 otherwise with MAX_TASKS_PER_NODE=8 also works fine, so the issue with 5400x1 ATM is not (obviously) due to a too large memory footprint. Note that I am using

 <entry id="PIO_STRIDE"   value="128"  />

to avoid PIO issues that arise when using the default (== 4).

worleyph commented 8 years ago

Update: I was sloppy in my experimental design, and mixed up two different issues.

a) pgi/16.3 does not like the microphysics code, and aborts at runtime with malloc errors or segmentation faults or arithmetic exceptions (probably all memory related). This occurs at more than just maximum scale. As pgi/16.3 is not the default version that we are using in the code, this is easily avoided, but we will need to track whether this will hurt us in the future. I had no luck pinpointing the location, but am pretty sure that it is just in the microphysics.

b) pgi/15.3 demonstrates the originally reported problem, and is in the dynamics and shows up in the MPI logic (though the source may be elsewhere). pgi/15.3 on TItan and the Cetus run (so IBM compiler) may have the same error signature? I'll move my focus to this problem, and ignore the pgi/16.3 problem for the moment.

Sorry for the confusion.

worleyph commented 8 years ago

Tracked down the problem to the call to initEdgeSBuffer, which calls initEdgeBuffer with nMethod set to .TRUE. (All of the other calls to initEdgeBuffer use the "old" method.)

(This shows up when calling neighbor_minmax, which calls bndry_exchangeS and uses the schedule information calculated in initEdgeSBuffer.)

When using the new method, you can have

 pSchedule%MoveCycle(1)%lengthS = 1

and

 pSchedule%MoveCycle(1)%ptrS = 0

which then leads to positive moveLength and negative pointer values when nlyr > 0.

    moveLength = nlyr*pSchedule%MoveCycle(1)%lengthS
    ptr       = nlyr*(pSchedule%MoveCycle(1)%ptrS -1) + 1

In contrast, for the old method, when

 pSchedule%MoveCycle(1)%ptrP = 0

we also have

pSchedule%MoveCycle(1)%lengthP = 0.

The relevant code is in edge_mod_base.F90 .

I know way too little to try to fix this. (Is it an error to use both the old and new methods? Is it a bug for lengthS to be > 0 and ptrS == 0, or is the error in the formula for moveLength and ptr for new method? Is this a bug in other situations as well, e.g., when there is more than one cell per process?)

Reassigning this to the HOMME expert ( @mt5555 ).

worleyph commented 8 years ago

Also changed this to critical as this may indicate a bug even when not using maximum parallelism.

mt5555 commented 8 years ago

I confirmed this bug in standalone HOMME, with ne16 on 1536 MPI tasks. ptr < 0, although the code still runs on the SNL linux cluster, but surely producing incorrect results.

On 768 MPI tasks, ptr is always >=1.

will debug further....

mt5555 commented 8 years ago

I was able to reproduce this behavior with ne4 on 96 cores. This one line fix (missing initialization) fixes it in that case:

--- a/components/homme/src/share/schedule_mod.F90
+++ b/components/homme/src/share/schedule_mod.F90
@@ -133,6 +133,7 @@ contains
     LSchedule%MoveCycle(1)%ptrP = 0
     LSchedule%MoveCycle(1)%ptrS = 0
     LSchedule%MoveCycle(1)%lengthP = 0
+    LSchedule%MoveCycle(1)%lengthS = 0
     if(Debug) write(iulog,*)'genEdgeSched: point #6'

     !==================================================================

worleyph commented 8 years ago

Thanks. Giving it a try.

mt5555 commented 8 years ago

@gold2718 : it looks like this bugfix should also be applied to the NCAR version of HOMME.

worleyph commented 8 years ago

@mt5555 , the fix eliminated the seg. fault in my reproducer on Titan/pgi

 -compset FC5AV1C-L -res ne30_oEC

with a PE layout of 5400x1, and I ran successfully for one simulated day.

ndkeen commented 8 years ago

fyi, i'm also finding the same error as Az in the coupled simulation I'm running on edison. I will try the fix.

ndkeen commented 8 years ago

Is this ready for a PR?

E3SM-Project / E3SM

Seg. fault / PMPI_Waitall with large negative counts when using max MPI parallelism in CAM-SE #762