charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Segfault in global variable destructors with -memory gnu #2974

Open nitbhat opened 4 years ago

nitbhat commented 4 years ago

I see that both allgather and alltoall (in benchmarks/ampi/alltoall/) are crashing during exit with this backtrace

On process 0:

Starting benchmark on 2 processors with 100 iterations
  100    1024       0.022 msec,   753.627 Mbits/sec
[Partition 0][Node 0] End of program
[Thread 0x7ffff2215700 (LWP 32664) exited]
[Thread 0x7ffff5018700 (LWP 32658) exited]
[Inferior 1 (process 32653) exited normally]

On process 1:

0x00007ffff684b438 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff684b438 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff684d03a in __GI_abort () at abort.c:89
#2  0x00000000008f1f9e in dlmalloc_impl::mspace_free (
    this=0xcd0680 <global_malloc_instance_storage>, 
    msp=0xcd0720 <main_arena+64>, mem=0x7ffff7ef0c80)
    at /scratch/nitin/charm/src/conv-core/memory-gnu-internal.C:5726
#3  0x00000000008e8153 in mm_impl_free (mem=0x7ffff7ef0c80)
    at /scratch/nitin/charm/src/conv-core/memory-gnu.C:874
#4  0x00000000008e981c in mm_free (mem=0x7ffff7ef0c80)
    at /scratch/nitin/charm/src/conv-core/memory.C:734
#5  0x00000000008e9a75 in free (mem=0x7ffff7ef0c80)
    at /scratch/nitin/charm/src/conv-core/memory.C:906
#6  0x00007ffff43e1179 in deregister_handler ()
   from /scratch/nitin/openmpi-4.0.1/build/lib/openmpi/mca_pmix_pmix3x.so
#7  0x00007ffff1013322 in finalize ()
   from /scratch/nitin/openmpi-4.0.1/build/lib/openmpi/mca_errmgr_default_app.so
#8  0x00007ffff65ab326 in orte_errmgr_base_close ()
   from /scratch/nitin/openmpi-4.0.1/build/lib/libopen-rte.so.40
#9  0x00007ffff62a87d9 in mca_base_framework_close ()
   from /scratch/nitin/openmpi-4.0.1/build/lib/libopen-pal.so.40
#10 0x00007ffff65ad93a in orte_ess_base_app_finalize ()

The failure is only seen in a 2 process run and this was seen on courage with an mpi-linux-x86_64-debug build (./build LIBS mpi-linux-x86_64 --suffix=debug --enable-error-checking -j16 -g -O0)

evan-charmworks commented 4 years ago

This problem occurs when allgather and alltoall are linked with -memory gnu. I can't reproduce the error with that argument removed.