charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

mpi-win-{smp} and sometimes mpi-linux-smp crash in /entry_method_post_api/unreg/simpleZeroCopy #2310

Closed nitbhat closed 5 years ago

nitbhat commented 5 years ago

http://charm.cs.illinois.edu/autobuild/old.2019_06_13__01_01/mpi-win-x86_64.txt

evan-charmworks commented 5 years ago

I am able to catch this problem in ASan with this one-liner: ./build charm++ mpi-linux-x86_64-smp --suffix=asan -j8 -g3 -fsanitize=address && pushd mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy && make OPTS="-g3 -fsanitize=address" -Bj8 && mpirun -n 16 -oversubscribe xterm -e './simpleZeroCopy 32 +balancer RotateLB +setcpuaffinity ; bash'

Here are ASan's output from a run with OpenMPI and a run with MPICH:

=================================================================
==19965==ERROR: AddressSanitizer: heap-use-after-free on address 0x618000013480 at pc 0x7f6bc050977a bp 0x7ffe8526f5a0 sp 0x7ffe8526ed48
WRITE of size 840 at 0x618000013480 thread T0
    #0 0x7f6bc0509779  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
    #1 0x7f6bbe4b52cd in opal_convertor_unpack (/usr/lib/x86_64-linux-gnu/libopen-pal.so.20+0x372cd)
    #2 0x7f6bb2fd4a3f in mca_pml_ob1_recv_frag_callback_match (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so+0xea3f)
    #3 0x7f6bb380c51e in mca_btl_vader_poll_handle_frag (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so+0x451e)
    #4 0x7f6bb380c82d  (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so+0x482d)
    #5 0x7f6bbe4a49eb in opal_progress (/usr/lib/x86_64-linux-gnu/libopen-pal.so.20+0x269eb)
    #6 0x7f6bb2fcc902 in mca_pml_ob1_iprobe (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so+0x6902)
    #7 0x7f6bbf7bf319 in MPI_Iprobe (/usr/lib/x86_64-linux-gnu/libmpi.so.20+0x69319)
    #8 0x5572e5fdc8b5 in PumpMsgs /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine.C:805
    #9 0x5572e5fdd76f in LrtsAdvanceCommunication(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine.C:1142
    #10 0x5572e5fd9eb8 in AdvanceCommunication /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1616
    #11 0x5572e5fd9ed0 in CommunicationServer /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1640
    #12 0x5572e5fd9f8c in CommunicationServerThread(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1663
    #13 0x5572e5fd9e03 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1591
    #14 0x5572e5fd9844 in ConverseInit /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1491
    #15 0x5572e5cdae35 in charm_main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1872
    #16 0x5572e5cc6829 in main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/main.C:5
    #17 0x7f6bbede5b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #18 0x5572e5be0619 in _start (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2d4619)
0x618000013480 is located 0 bytes inside of 840-byte region [0x618000013480,0x6180000137c8)
freed by thread T3 here:
    #0 0x7f6bc0571490 in operator delete[](void*) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe1490)
    #1 0x5572e5c02968 in zerocopyObject::~zerocopyObject() (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f6968)
    #2 0x5572e5c029f7 in zerocopyObject::~zerocopyObject() (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f69f7)
    #3 0x5572e5d87063 in CkArray::deleteElt(unsigned long) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ckarray.h:642
    #4 0x5572e5d6f04f in CkLocMgr::emigrate(CkLocRec*, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:3117
    #5 0x5572e5d64a30 in CkLocRec::migrateMe(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:1891
    #6 0x5572e5d65b49 in CkLocRec::recvMigrate(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:2027
    #7 0x5572e5d65ae2 in CkLocRec::staticMigrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:2020
    #8 0x5572e5efafda in LBOM::Migrate(LDObjHandle, int) ../bin/../include/LBOM.h:37
    #9 0x5572e5ef73ff in LBDB::Migrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/LBDBManager.C:345
    #10 0x5572e5eb7167 in LDMigrate /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/lbdb.C:399
    #11 0x5572e5f3f39e in LBDatabase::Migrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/LBDatabase.h:355
    #12 0x5572e5f1f3d2 in CentralLB::ProcessReceiveMigration() /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CentralLB.C:1190
    #13 0x5572e5f281a6 in CkIndex_CentralLB::_call_redn_wrapper_ProcessReceiveMigration_void(void*, void*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CentralLB.def.h:900
    #14 0x5572e5cf9871 in CkDeliverMessageFree /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:569
    #15 0x5572e5cf9d32 in _invokeEntryNoTrace /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:620
    #16 0x5572e5cfa0d2 in _invokeEntry /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:631
    #17 0x5572e5cff061 in _deliverForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1089
    #18 0x5572e5cff2fe in _processForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1111
    #19 0x5572e5d00830 in _processHandler(void*, CkCoreState*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1284
    #20 0x5572e5fe3a4a in CmiHandleMessage /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1656
    #21 0x5572e5fe4804 in CsdScheduleForever /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1914
    #22 0x5572e5fe45bb in CsdScheduler /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1842
    #23 0x5572e5fd9e43 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1596
    #24 0x5572e5fd25a3 in call_startfn /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:444
    #25 0x7f6bc000a6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
previously allocated by thread T3 here:
    #0 0x7f6bc0570618 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0618)
    #1 0x5572e5c03ab9 in zerocopyObject::testZeroCopy(CProxy_Main) (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f7ab9)
    #2 0x5572e5be537f in CkIndex_zerocopyObject::_call_testZeroCopy_marshall2(void*, void*) /home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy.def.h:1226
    #3 0x5572e5cf9b6f in CkDeliverMessageReadonly /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:603
    #4 0x5572e5d65565 in CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:1980
    #5 0x5572e5d0a8f4 in CkMigratable::ckInvokeEntry(int, void*, bool) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ckmigratable.h:79
    #6 0x5572e5e16edc in CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ckarray.C:1348
    #7 0x5572e5e189e7 in CkArray::recvBroadcast(CkMessage*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ckarray.C:1611
    #8 0x5572e5e2073c in CkIndex_CkArray::_call_recvBroadcast_CkMessage(void*, void*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CkArray.def.h:1136
    #9 0x5572e5cf9871 in CkDeliverMessageFree /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:569
    #10 0x5572e5cf9d32 in _invokeEntryNoTrace /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:620
    #11 0x5572e5cfa3fc in _invokeEntry /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:638
    #12 0x5572e5cff061 in _deliverForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1089
    #13 0x5572e5cff2fe in _processForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1111
    #14 0x5572e5d00830 in _processHandler(void*, CkCoreState*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1284
    #15 0x5572e5cd3c6a in _processBufferedMsgs /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:826
    #16 0x5572e5cd4119 in _initDone() /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:890
    #17 0x5572e5cd4de3 in checkForInitDone(bool) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1038
    #18 0x5572e5cd558b in _initHandler /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1105
    #19 0x5572e5fe3a4a in CmiHandleMessage /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1656
    #20 0x5572e5fe4804 in CsdScheduleForever /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1914
    #21 0x5572e5fe45bb in CsdScheduler /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1842
    #22 0x5572e5fd9e43 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1596
    #23 0x5572e5fd25a3 in call_startfn /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:444
    #24 0x7f6bc000a6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
Thread T3 created by T0 here:
    #0 0x7f6bc04c7d2f in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d2f)
    #1 0x5572e5fd2ade in CmiStartThreads /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:554
    #2 0x5572e5fd9837 in ConverseInit /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1489
    #3 0x5572e5cdae35 in charm_main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1872
    #4 0x5572e5cc6829 in main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/main.C:5
    #5 0x7f6bbede5b96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779) 
Shadow bytes around the buggy address:
  0x0c307fffa640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffa650: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffa660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffa670: 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa
  0x0c307fffa680: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c307fffa690:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffa6a0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffa6b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffa6c0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffa6d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffa6e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==19965==ABORTING
=================================================================
==13423==ERROR: AddressSanitizer: heap-use-after-free on address 0x618000015c80 at pc 0x7f7b7368277a bp 0x7ffe997ce500 sp 0x7ffe997cdca8
WRITE of size 840 at 0x618000015c80 thread T0
    #0 0x7f7b73682779  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
    #1 0x7f7b728217e0  (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x1167e0)
    #2 0x7f7b72861ede  (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x156ede)
    #3 0x7f7b7287abd9  (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x16fbd9)
    #4 0x7f7b72866d70  (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x15bd70)
    #5 0x7f7b727b95f0 in PMPI_Iprobe (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0xae5f0)
    #6 0x559610f4e7ea in PumpMsgs /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine.C:805
    #7 0x559610f4f699 in LrtsAdvanceCommunication(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine.C:1142
    #8 0x559610f4bdf8 in AdvanceCommunication /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1616
    #9 0x559610f4be10 in CommunicationServer /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1640
    #10 0x559610f4becc in CommunicationServerThread(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1663
    #11 0x559610f4bd43 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1591
    #12 0x559610f4b784 in ConverseInit /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1491
    #13 0x559610c4cd75 in charm_main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1872
    #14 0x559610c38769 in main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/main.C:5
    #15 0x7f7b71d9ab96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #16 0x559610b52559 in _start (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2d4559)
0x618000015c80 is located 0 bytes inside of 840-byte region [0x618000015c80,0x618000015fc8)
freed by thread T1 here:
    #0 0x7f7b736ea490 in operator delete[](void*) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe1490)
    #1 0x559610b748a8 in zerocopyObject::~zerocopyObject() (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f68a8)
    #2 0x559610b74937 in zerocopyObject::~zerocopyObject() (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f6937)
    #3 0x559610cf8fa3 in CkArray::deleteElt(unsigned long) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ckarray.h:642
    #4 0x559610ce0f8f in CkLocMgr::emigrate(CkLocRec*, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:3117
    #5 0x559610cd6970 in CkLocRec::migrateMe(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:1891
    #6 0x559610cd7a89 in CkLocRec::recvMigrate(int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:2027
    #7 0x559610cd7a22 in CkLocRec::staticMigrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:2020
    #8 0x559610e6cf1a in LBOM::Migrate(LDObjHandle, int) ../bin/../include/LBOM.h:37
    #9 0x559610e6933f in LBDB::Migrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/LBDBManager.C:345
    #10 0x559610e290a7 in LDMigrate /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/lbdb.C:399
    #11 0x559610eb12de in LBDatabase::Migrate(LDObjHandle, int) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/LBDatabase.h:355
    #12 0x559610e91312 in CentralLB::ProcessReceiveMigration() /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CentralLB.C:1190
    #13 0x559610e9a0e6 in CkIndex_CentralLB::_call_redn_wrapper_ProcessReceiveMigration_void(void*, void*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CentralLB.def.h:900
    #14 0x559610c6b7b1 in CkDeliverMessageFree /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:569
    #15 0x559610c6bc72 in _invokeEntryNoTrace /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:620
    #16 0x559610c6c012 in _invokeEntry /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:631
    #17 0x559610c70fa1 in _deliverForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1089
    #18 0x559610c7123e in _processForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1111
    #19 0x559610c72770 in _processHandler(void*, CkCoreState*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1284
    #20 0x559610f55950 in CmiHandleMessage /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1656
    #21 0x559610f5670a in CsdScheduleForever /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1914
    #22 0x559610f564c1 in CsdScheduler /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1842
    #23 0x559610f4bd83 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1596
    #24 0x559610f444e3 in call_startfn /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:444
    #25 0x7f7b731836da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
previously allocated by thread T1 here:
    #0 0x7f7b736e9618 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0618)
    #1 0x559610b74189 in zerocopyObject::pup(PUP::er&) (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x2f6189)
    #2 0x559610b8273a in recursive_pup_impl<zerocopyObject, 1>::operator()(zerocopyObject*, PUP::er&) (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x30473a)
    #3 0x559610b80079 in void recursive_pup<zerocopyObject>(zerocopyObject*, PUP::er&) (/home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy+0x302079)
    #4 0x559610b6a942 in CBaseT1<ArrayElementT<int>, CProxy_zerocopyObject>::virtual_pup(PUP::er&) /home/evan/charm/mpi-linux-x86_64-smp-asan/examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy/simpleZeroCopy.def.h:3751
    #5 0x559610cdfc4b in CkLocMgr::pupElementsFor(PUP::er&, CkLocRec*, CkElementCreation_t, bool) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:2976
    #6 0x559610ce15e4 in CkLocMgr::immigrate(CkArrayElementMigrateMessage*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/cklocation.C:3176
    #7 0x559610ce6a84 in CkIndex_CkLocMgr::_call_immigrate_CkArrayElementMigrateMessage(void*, void*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/CkLocation.def.h:683
    #8 0x559610c6b7b1 in CkDeliverMessageFree /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:569
    #9 0x559610c6bc72 in _invokeEntryNoTrace /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:620
    #10 0x559610c6c33c in _invokeEntry /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:638
    #11 0x559610c70fa1 in _deliverForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1089
    #12 0x559610c7123e in _processForBocMsg /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1111
    #13 0x559610c72770 in _processHandler(void*, CkCoreState*) /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/ck.C:1284
    #14 0x559610f55950 in CmiHandleMessage /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1656
    #15 0x559610f5670a in CsdScheduleForever /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1914
    #16 0x559610f564c1 in CsdScheduler /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/convcore.C:1842
    #17 0x559610f4bd83 in ConverseRunPE /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1596
    #18 0x559610f444e3 in call_startfn /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:444
    #19 0x7f7b731836da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
Thread T1 created by T0 here:
    #0 0x7f7b73640d2f in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d2f)
    #1 0x559610f44a1e in CmiStartThreads /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-smp.C:554
    #2 0x559610f4b777 in ConverseInit /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/machine-common-core.C:1489
    #3 0x559610c4cd75 in charm_main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/init.C:1872
    #4 0x559610c38769 in main /home/evan/charm/mpi-linux-x86_64-smp-asan/tmp/main.C:5
    #5 0x7f7b71d9ab96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779) 
Shadow bytes around the buggy address:
  0x0c307fffab40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffab50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffab60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c307fffab70: 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa
  0x0c307fffab80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c307fffab90:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffaba0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffabb0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffabc0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffabd0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c307fffabe0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==13423==ABORTING
nitbhat commented 5 years ago

The failure is because of the incorrect and limited sdag support for the ZC Post API.

Currently, the Post Entry Method doesn't match sdag tags and issues Rgets directly. The sdag tag matching is performed only after the Rget has been completed in the actual entry method.

The failure happens during/after the load balancing phase, when the sender has migrated and resumed (and sent iteration i+1 to the receiver) and the receiver has reached AtSync for the ith load-balancing iteration, but not migrated. On receiving the i+1 metadata message, the receiver posts a buffer and issues an rget. Since the load balancing phase is in progress, the receiver migrates the object before the rget completion (and in that process frees the allocated buffer).

Since the posted buffer is freed, MPI_IProbe complains with the heap-use-after-free error because the same buffer is used across iterations.

For 6.10, I think it's better to not support sdag entry methods with the ZC Post API and print a suitable error message (and add it as a note in the documentation) because of significant changes that'll be required to support this.

To support sdag entry methods with ZC Post API, the Post Entry method will also have to use sdag matching and buffer the received metadata message (wherever applicable).

nitbhat commented 5 years ago

In a discussion with Sanjay (@lvkale), we decided that the following options are also possible:

  1. Ensure that the sender doesn't send the buffer unless migration has completed on all PEs. This ensures that the receiver won't post the buffer until migration has completed. (This can be done using a reduction following ResumeFromSync)

  2. Allow users to post "later" by providing RTS constructs that allow users to store the received metadata and retrieve the stored metadata at a later time. This can be useful in situations on the receiver side, where the object has to migrate and can post the buffer after migration. This is also useful for AMPI where @stwhite91 had brought up a use-case for delayed posting i.e. posting in cases when the receiver buffer is not ready to be posted.

  3. The third alternative (which seems to be the most involved) is the one described in the previous comment. Here, the sdag logic can be made available to the Post EM in order to not post until the iteration matching is performed.

For 6.10, I think I can modify the example based on 1 and can add user documentation for it. We can implement 2 and 3 in the future.