charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

segfault during ibverbs-smp startup on 512 nodes of Stampede #431

Closed jcphill closed 10 years ago

jcphill commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/431


At some point between Feb 17 (v6.6.0-rc2-0-gbcd2533-namd-charm-6.6-build-2014-Feb-17-19385) and Feb 24, net-linux-x86_64-ibverbs-smp-iccstatic on 512 nodes of Stampede started crashing very early during startup:

Charmrun> started all node programs in 7.119 seconds. Charmrun> IBVERBS version of charmrun Converse/Charm++ Commit ID: v6.6.0-rc2-12-g3760b51-namd-charm-6.6-build-2014-Feb-24-6826 Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 1-15 Charm++> set comm 0 on node 0 to core #0 Charm++> Running on 512 unique compute nodes (16-way SMP). Charm++> cpu topology info is gathered in 0.055 seconds. Info: NAMD 2.9 for Linux-x86_64-ibverbs-smp-Stampede-memopt Warning: Warning: EXPERIMENTAL MEMORY OPTIMIZED VERSION Warning: Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: for updates, documentation, and support information. Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 60600 for net-linux-x86_64-ibverbs-smp-iccstatic Info: Built Mon Feb 24 13:36:32 CST 2014 by tg455591 on login3.stampede.tacc.utexas.edu Info: 1 NAMD 2.9 Linux-x86_64-ibverbs-smp-Stampede-memopt 7680 c401-402.stampede.tacc.utexas.edu tg455591 Info: Running on 7680 processors, 512 nodes, 512 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.784644 s ------------- Processor 4530 Exiting: Caught Signal ------------ Signal: segmentation violation Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime). [4530] Stack Traceback: [4530:0] [0xd64ac1] [4530:1] [0xabf5cb] [4530:2] [0xabf36c] [4530:3] [0xba883e] [4530:4] [0xbaaef0] [4530:5] [0xd2b54e] [4530:6] [0xd65176] [4530:7] [0xba1005] [4530:8] [0x4b3c98] [4530:9] [0x4ab681] [4530:10] __libc_start_main+0xfd [0x316f21ecdd] [4530:11] [0x40f119] Fatal error on PE 4530> segmentation violation starting replica run /work/00288/tg455591/NAMD_build.latest/NAMD_2.9_Linux-x86_64-verbs-smp-Stampede TACC: Starting up job 2963613 TACC: Setting up parallel environment for MVAPICH2+mpispawn. TACC: Starting parallel tasks...

TACC: Shutdown complete. Exiting.

This is what a a successful run with a Feb 17 binary looks like: Charmrun> started all node programs in 5.093 seconds. Charmrun> IBVERBS version of charmrun Converse/Charm++ Commit ID: v6.6.0-rc2-0-gbcd2533-namd-charm-6.6-build-2014-Feb-17-19385 Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 1-15 Charm++> set comm 0 on node 0 to core #0 Charm++> Running on 512 unique compute nodes (16-way SMP). Charm++> cpu topology info is gathered in 0.207 seconds. Info: NAMD 2.9 for Linux-x86_64-ibverbs-smp-Stampede-memopt Warning: Warning: EXPERIMENTAL MEMORY OPTIMIZED VERSION Warning: Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: for updates, documentation, and support information. Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 60600 for net-linux-x86_64-ibverbs-smp-iccstatic Info: Built Mon Feb 17 09:48:24 CST 2014 by tg455591 on login3.stampede.tacc.utexas.edu Info: 1 NAMD 2.9 Linux-x86_64-ibverbs-smp-Stampede-memopt 7680 c401-402.stampede.tacc.utexas.edu tg455591 Info: Running on 7680 processors, 512 nodes, 512 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.891502 s Info: 3315.19 MB of memory in use based on /proc/self/stat Info: Configuration file is /work/00288/tg455591/stmv/210stmv.namd Info: Changed directory to /work/00288/tg455591/stmv TCL: Suspending until startup complete. ...

PhilMiller commented 5 years ago

Original date: 2014-03-10 20:13:43


Assuming the crash is readily reproducible, could we get the associated stack trace from that?

Likely culprits include c5c151e8 (CkArrayOptions: missing pup of new 'array bounds' field) if the crash occurs during/around setup of a chare array, and a10274e311 (Ibverbs: Fix Bug #305: Cannot launch on stampede with >4k processes).

The latter case can be easily tested by passing an argument other than 1000 to +IBVMaxSendTokens.

PhilMiller commented 5 years ago

Original date: 2014-03-10 20:17:59


If you're unfamiliar, addr2line can often be used to transform a sequence of instruction addresses to a stack trace.

jcphill commented 5 years ago

Original date: 2014-03-11 13:28:42


Manual rebuild from the same source runs fine. Trying new automated build today.

jcphill commented 5 years ago

Original date: 2014-03-11 17:45:08


A new automated build works fine.