charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
204 stars 49 forks source link

Anomalous initialization messages in SMP build #2969

Closed minitu closed 3 years ago

minitu commented 4 years ago

The following anomalous initialization statements show up with 1darray hello when run with multiple PEs per process (e.g. 4 processes with 10 PEs per process) on PAMILRTS SMP build on LLNL Lassen:

choi18@lassen7:1darray$ jsrun -n4 -a1 -c10 -g1 -K2 -r4 ./hello 80 +ppn 10 +pemap L0-160:4
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 4 processes, 10 worker threads (PEs) + 0 comm threads per process, 40 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 4 processes, 10 worker threads (PEs) + 0 comm threads per process, 40 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 4 processes, 10 worker threads (PEs) + 0 comm threads per process, 40 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 4 processes, 10 worker threads (PEs) + 0 comm threads per process, 40 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Converse/Charm++ Commit ID: v6.11.0-devel-283-ge4a6e7b3f
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map (logical indices): 0-160:4
Charm++> Running on 1 hosts (2 sockets x 20 cores x 4 PUs = 160-way SMP)
Charm++> cpu topology info is gathered in 0.005 seconds.
Running Hello on 40 processors for 80 elements
...

I did a git bisect and found the offending commit to be #2531.

Before this commit:

choi18@lassen7:1darray$ jsrun -n4 -a1 -c10 -g1 -K2 -r4 ./hello 80 +ppn 10 +pemap L0-160:4
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 4 processes, 10 worker threads (PEs) + 0 comm threads per process, 40 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Converse/Charm++ Commit ID: v6.11.0-devel-282-g431a3e508
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> cpuaffinity PE-core map (logical indices): 0-160:4
Charm++> Running on 1 hosts (2 sockets x 20 cores x 4 PUs = 160-way SMP)
Charm++> cpu topology info is gathered in 0.005 seconds.
Running Hello on 40 processors for 80 elements
...
minitu commented 4 years ago

@rbuch @kavithachandrasekar Any ideas on what in #2531 could have caused this? I did a little digging and it looks like _Cmi_mynode is not set properly in LrtsInit, causing multiple processes to print the configurations.

rbuch commented 4 years ago

@rbuch @kavithachandrasekar Any ideas on what in #2531 could have caused this? I did a little digging and it looks like _Cmi_mynode is not set properly in LrtsInit, causing multiple processes to print the configurations.

Hmm, no, I'm not quite sure what would be causing that. If I'm reading the code correctly, it looks like _Cmi_mynode is set directly from a value returned by the PAMI API, so I don't know how the LB refactor could get in the way of that. Have you seen this issue anywhere else? I remember there being issues (not this specific one, just general problems) on the SMP version of pamilrts a long time back, but I assume those have been resolved since then, @nitbhat?

stwhite91 commented 4 years ago

@philmiller-charmworks was seeing this error (segfault when accessing _Cmi_mynode in startup when running mpi4py on AMPI last week. @evan-charmworks may know more about it too.

evan-charmworks commented 4 years ago

I got the following weird outputs with multicore-linux-x86_64:

Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 32 threads (PEs)
Converse/Charm++ Commit ID: v6.11.0-devel-355-g3a973d587
Converse/Charm++ Commit ID: v6.11.0-devel-355-g3a973d587
Converse/Charm++ Commit ID: v6.11.0-devel-355-g3a973d587
Converse/Charm++ Commit ID: v6.11.0-devel-355-g3a973d587
Charm++ built without optimization.
Do not use for perforCharm++ built without optimization.
Do not use for perforCharm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-eCharm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-eCharm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-eCharm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-eCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.017 seconds.
WARNING: Multiple PEs assigned to same core, recommend adjusting processor affinity or passing +CmiSleepOnIdle to reduce interference.
[0] TreeLB in LEGACY MODE support
[0] TreeLB: Using PE_Root tree with strategy Greedy
send: completed
zerocopySend: completed
mixedSend: completed
sdagRun: Iteration 2 completed
sdagRun: Iteration 3 completed
sdagRun: Iteration 4 completed
sdagRun: Iteration 5 completed
sdagRun: Iteration 6 completed
sdagRun: Iteration 7 completed
sdagRun: Iteration 8 completed
sdagRun: Iteration 9 completed
sdagRun: Iteration 10 completed
sdagRun: Iteration 11 completed
sdagRun: Iteration 12 completed
sdagRun: Iteration 13 completed
sdagRun: Iteration 14 completed
sdagRun: Iteration 15 completed
sdagRun: Iteration 16 completed
sdagRun: Iteration 17 completed
sdagRun: Iteration 18 completed
sdagRun: Iteration 19 completed
sdagRun: Iteration 20 completed
sdagRun: Iteration 21 completed
sdagRun: Iteration 22 completed
sdagRun: Iteration 23 completed
sdagRun: Iteration 24 completed
sdagRun: Iteration 25 completed
sdagRun: Iteration 26 completed
sdagRun: Iteration 27 completed
sdagRun: Iteration 28 completed
sdagRun: Iteration 29 completed
sdagRun: Iteration 30 completed
sdagRun: Iteration 31 completed
sdagRun: Iteration 32 completed
sdagRun: Iteration 33 completed
sdagRun: Iteration 34 completed
sdagRun: Iteration 35 completed
sdagRun: Iteration 36 completed
sdagRun: Iteration 37 completed
sdagRun: Iteration 38 completed
sdagRun: Iteration 39 completed
sdagRun: Iteration 40 completed
sdagRun: completed
All sending completed and result validated
[Partition 0][Node 0] End of program
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 32 threads (PEs)
Converse/Charm++ Commit ID: v6.11.0-devel-355-g3a973d587
Charm++ built without optimization.
Do not use for perforCharm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharm++ built with internal error checking enabled.
Do noCharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.020 seconds.
WARNING: Multiple PEs assigned to same core, recommend adjusting processor affinity or passing +CmiSleepOnIdle to reduce interference.
[0] TreeLB in LEGACY MODE support
[0] TreeLB: Using PE_Root tree with strategy Greedy
Segmentation fault (core dumped)

If the generated core dump is correct, the segfault is the "Multiple PEs assigned to same core." abort message.

evan-charmworks commented 3 years ago

Something similar happened with GNI SMP autobuild:

http://charm.cs.illinois.edu/autobuild/old.2021_04_29__01_04/gni-crayxc-smp.txt

../../../bin/testrun  ./hello 10 +p4 ++ppn 2  +CmiSleepOnIdle
ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-intel' is currently not loaded
ModuleCmd_Switch.c(179):ERROR:152: Module 'PrgEnv-intel' is currently not loaded

Running as 2 OS processes:  ./hello 10 +ppn 2 +CmiSleepOnIdle
srun -n 2 -c 3 ./hello 10 +ppn 2 +CmiSleepOnIdle
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: 2 processes, 2 worker threads (PEs) + 1 comm threads per process, 4 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: 2 processes, 2 worker threads (PEs) + 1 comm threads per process, 4 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: 2 processes, 2 worker threads (PEs) + 1 comm threads per process, 4 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: 32d3e2b
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
*** Error in `/global/project/projectdirs/m2609/autobuild/gni-crayxc-smp/charm/gni-crayxc-smp/tests/charm++/simplearrayhello/./hello': double free or corruption (top): 0x00002aaaf40017b0 ***
srun: error: nid02220: task 1: Aborted
srun: Terminating job step 42100204.89
slurmstepd: error: *** STEP 42100204.89 ON nid02220 CANCELLED AT 2021-04-29T06:15:58 ***
srun: error: nid02220: task 0: Terminated
srun: Force Terminated job step 42100204.89

real    0m1.753s
user    0m0.105s
sys 0m0.110s
make[3]: *** [Makefile:28: smptest] Error 143
make[3]: Leaving directory '/global/project/projectdirs/m2609/autobuild/gni-crayxc-smp/charm/gni-crayxc-smp/tests/charm++/simplearrayhello'
make[2]: *** [Makefile:75: smptest-simplearrayhello] Error 2
trquinn commented 3 years ago

I am getting a *** Error in '/scratch/e1000/trq/bench/./ChaNGa.smp': double free or corruption (out): 0x00002b0200000ca0 *** that I bisected back to the same commit (#2531). The backtrace points to the following lines in convcore.C:

#if CMK_HAS_IO_FILE_OVERFLOW
  // forcibly allocate output buffers now, see issue #2814
  _IO_file_overflow(stdout, -1);
  _IO_file_overflow(stderr, -1);
#endif

This is on a Cray EX building with mpi-linux-x86_64 smp.

trquinn commented 3 years ago

Commenting out the above lines fixes the double free and also gets rid of the anomalous initialization messages reported above.

rbuch commented 3 years ago

@evan-charmworks Are these _IO_file_overflow calls needed here? Are they needed to fix #2814?