charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 50 forks source link

tests/ampi/megampi crashes in MPI_Comm_free #2030

Closed evan-charmworks closed 5 years ago

evan-charmworks commented 5 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/2030


This failure shows up in autobuild every few days.

../../../bin/testrun  ./pgm +p2 +vp4  

Running on 2 processors:  ./pgm +vp4 
charmrun> /cygdrive/c/Program Files/Microsoft MPI/Bin/mpiexec -n 2  ./pgm +vp4 

Charm++> Running on MPI version: 2.0
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 2 processes, 1 worker threads (PEs) + 1 comm threads per process, 2 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++ warning> fences and atomic operations not available in native assembly
Converse/Charm++ Commit ID: v6.9.0-0-gc3d50ef
Charm++> Disabling isomalloc because mmap() does not work.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.016 seconds.
CharmLB> RandCentLB created.

job aborted:
[ranks] message

[0] terminated

[1] process exited without calling finalize

---- error analysis -----

[1] on CS-DEXTERITY
./pgm ended prematurely and may have crashed. exit code 0xc0000005

---- error analysis -----
make[3]: *** [Makefile:24: test] Error 127
make[2]: *** [Makefile:38: test-megampi] Error 2
make[1]: *** [Makefile:37: test-ampi] Error 2
make: *** [Makefile:165: test] Error 2
make[3]: Leaving directory '/home/nikhil/autobuild/mpi-win-x86_64-smp/charm/mpi-win-x86_64-smp/tests/ampi/megampi'

http://charm.cs.illinois.edu/autobuild/old.2018_11_06__01_01/mpi-win-x86_64-smp.txt http://charm.cs.illinois.edu/autobuild/old.2018_11_10__01_01/mpi-win-x86_64-smp.txt http://charm.cs.illinois.edu/autobuild/old.2018_11_14__01_01/mpi-win-x86_64-smp.txt

tsan.log.3262

evan-charmworks commented 5 years ago

Original date: 2019-01-11 20:11:00


I think this showed up in a multicore-win-x86_64 build today, in addition to mpi-win-x86_64-smp:

http://charm.cs.illinois.edu/autobuild/old.2019_01_11__01_07/multicore-win-x86_64.txt http://charm.cs.illinois.edu/autobuild/old.2019_01_11__01_07/mpi-win-x86_64-smp.txt

stwhite91 commented 5 years ago

Original date: 2019-02-07 20:08:25


It happened on mpi-win-smp today, and generally seems to happen somewhat frequently though not everytime.

evan-charmworks commented 5 years ago

Original date: 2019-03-06 17:11:47


I managed to catch this crash in Visual Studio's debugger. ampi::getRank() is called with this pointing to garbage.

pgm.exe!ampi::getRank() Line 2588
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2588)
pgm.exe!MPI_Comm_free(int * comm) Line 8994
    at tmp\libs\ck-libs\ampi\ampi.c(8994)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I tried the following change to help diagnose the problem:

diff --git a/src/libs/ck-libs/ampi/ampi.C b/src/libs/ck-libs/ampi/ampi.C
index dad98cf50..8f32e8e77 100644
--- a/src/libs/ck-libs/ampi/ampi.C
+++ b/src/libs/ck-libs/ampi/ampi.C
`` -8990,7 +8990,9 `` AMPI_API_IMPL(int, MPI_Comm_free, MPI_Comm *comm)
     //ret = parent->freeUserKeyvals(*comm, parent->getKeyvals(*comm));
     if (*comm != MPI_COMM_WORLD && *comm != MPI_COMM_SELF) {
       ampi* ptr = getAmpiInstance(*comm);
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 1
       ptr->barrier();
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 2
       if (ptr->getRank() == 0) {
         CProxy_CkArray(ptr->ckGetArrayID()).ckDestroy();
       }

The odd thing is that assertion 1 succeeds but assertion 2 fails.

ptr->barrier() calls thread->suspend(), which calls CthSuspend(). I suspect the problem is there.

Alternatively, there is the following comment in tcharm_impl.h:

        /* SUBTLE: We have to do the get() because "this" may have changed
         * during a migration-suspend.  If you access *any* members
         * from this point onward, you'll cause heap corruption if
         * we're resuming from migration!  (OSL 2003/9/23) */

I tried changing assertion 2 to CmiEnforce(*comm == getAmpiInstance(*comm)->getCommStruct().getComm()); but it still failed, just with a null pointer deference here:

pgm.exe!CkArray::lookup(const CkArrayIndex & idx) Line 595
    at tmp\ckarray.h(595)
pgm.exe!CProxyElement_ArrayBase::ckLocal() Line 743
    at tmp\ckarray.c(743)
pgm.exe!CProxyElement_ArrayElement::ckLocal() Line 1031
    at include\ckarray.decl.h(1031)
pgm.exe!CProxyElement_ampi::ckLocal() Line 1650
    at tmp\libs\ck-libs\ampi\ampi.decl.h(1650)
pgm.exe!ampiParent::comm2ampi(int comm) Line 2170
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2170)
pgm.exe!getAmpiInstance(int comm) Line 3799
    at tmp\libs\ck-libs\ampi\ampi.c(3799)
pgm.exe!MPI_Comm_free(int * comm) Line 8996
    at tmp\libs\ck-libs\ampi\ampi.c(8996)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I am more inclined to believe the problem is in CthSuspend(); because this failure does not always occur, and it only occurs on Windows.

evan-charmworks commented 5 years ago

Original date: 2019-03-15 19:32:07


I tried running megampi on Linux with ThreadSanitizer and the list of data races was substantial. Some of them look like candidates for this issue, including AMPI implementation details relevant to the failure seen on Windows.

./build AMPI multicore-linux-x86_64 tsan -j8 -g3 -fsanitize=thread && cd multicore-linux-x86_64-tsan/tests/ampi/megampi/ && make -j8 OPTS="-g3 -fsanitize=thread" && TSAN_OPTIONS='log_path=tsan.log' ./pgm +p4 +vp2 +tcharm_nomig +noisomalloc
stwhite91 commented 5 years ago

Original date: 2019-03-15 20:00:16


Can you post the tsan output here?

evan-charmworks commented 5 years ago

Original date: 2019-03-15 22:58:30


I ran megampi on Windows with a Microsoft tool called "Application Verifier":https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/application-verifier and it pointed out these two additional problems but I'm not sure either can be blamed for this issue.

  1. "Invalid TLS index used for current stack trace."
        <avrf:logEntry Time="2019-03-15 : 17:46:57" LayerName="Handles" StopCode="0x301" Severity="Error">
            <avrf:message>Invalid TLS index used for current stack trace.</avrf:message>
            <avrf:parameter1>ffffffff - Invalid TLS index.</avrf:parameter1>
            <avrf:parameter2>abba - Expected lower part of the index.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620caef9 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620cb12f ( ` 0)</avrf:trace>
                <avrf:trace>pgm!CmiGetState+10 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-smp.c ` 115)</avrf:trace>
                <avrf:trace>pgm!CmiMyPe+9 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c ` 399)</avrf:trace>
                <avrf:trace>pgm!CmiAddCLA+18 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c ` 325)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlagDesc+29 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c ` 579)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlag+24 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c ` 589)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+2e (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c ` 1197)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c ` 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c ` 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp ` 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( ` 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( ` 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
CmiState CmiGetState(void)
{
  CmiState result;
  result = (CmiState)TlsGetValue(Cmi_state_key); // Cmi_state_key is 0xFFFFFFFF here
  1. "NULL handle passed as parameter. A valid handle must be used."
        <avrf:logEntry Time="2019-03-15 : 17:48:34" LayerName="Handles" StopCode="0x303" Severity="Error">
            <avrf:message>NULL handle passed as parameter. A valid handle must be used.</avrf:message>
            <avrf:parameter1>0 - Not used.</avrf:parameter1>
            <avrf:parameter2>0 - Not used.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620b3138 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5847 ( ` 0)</avrf:trace>
                <avrf:trace>KERNELBASE!WaitForSingleObjectEx+a2 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53c8 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( ` 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53a5 ( ` 0)</avrf:trace>
                <avrf:trace>pgm!LrtsLock+19 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c ` 1975)</avrf:trace>
                <avrf:trace>pgm!CmiArgInit+15 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c ` 372)</avrf:trace>
                <avrf:trace>pgm!ConverseCommonInit+34d (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c ` 3816)</avrf:trace>
                <avrf:trace>pgm!ConverseRunPE+39c (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c ` 1578)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+66f (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c ` 1500)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c ` 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c ` 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl ` 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp ` 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( ` 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( ` 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
void CmiArgInit(char **argv) {
    int i;
    CmiLock(_smp_mutex); // _smp_mutex is null here
evan-charmworks commented 5 years ago

Original date: 2019-03-20 19:37:36


These global variables in TCharm and AMPI are potential candidates for causing this issue due to data races:

static mpi_comm_worlds mpi_worlds;
int _mpi_nworlds;

static CProxy_ampiWorlds ampiWorldsGroup;

CtvExtern(TCharm *,_curTCharm);
evan-charmworks commented 5 years ago

Original date: 2019-04-19 18:37:54


It looks like the original failure on mpi-win-x86_64-smp and multicore-win-x86_64 is due to two compounding problems. One is that AMPI's MPI_Comm_free does not properly refresh its ampi * pointer after calling a barrier, during which migration might take place. This patch fixes this simple oversight and cleans up pointer refreshing after migration across all of AMPI: https://charm.cs.illinois.edu/gerrit/c/charm/+/5095

With this patch in place, a second issue is exposed: After migration, sometimes CProxyElement_ArrayBase::ckLocalBranch() returns null during the call to getAmpiInstance. Fortunately this issue is easily reproducible on Linux and macOS in addition to Windows.

A one-liner to do this is ./build AMPI netlrts-linux-x86_64-smp -j8 -g3 && pushd netlrts-linux-x86_64-smp/tests/ampi/megampi && make -j8 OPTS="-g3" && ./charmrun ./pgm +p2 +vp4 ++local ++debug-no-pause +isomalloc_sync +CmiSleepOnIdle. Swap +vp2 for +vp4 to crash in a different location in megampi, and linux for darwin to cause the same crash on macOS.

+p2 +vp2:

Thread 1 "pgm" received signal SIGSEGV, Segmentation fault.
0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
595         if (locMgr->lookupID(idx,id)) {
(gdb) bt
#0  0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
#1  0x000055555588a2a0 in CProxyElement_ArrayBase::ckLocal (this=0x2aaaa0903880) at ckarray.C:742
#2  0x0000555555775646 in CProxyElement_ArrayElement::ckLocal (this=0x2aaaa0903880) at ../../../../bin/../include/CkArray.decl.h:1031
#3  0x00005555557c2766 in CProxyElement_ampi::ckLocal (this=0x2aaaa0903880) at ampi.decl.h:1650
#4  0x00005555557c58a4 in ampiParent::comm2ampi (this=0x555555d46900, comm=1000003) at ampiimpl.h:2162
#5  0x000055555578e377 in getAmpiInstance (comm=1000003) at ampi.C:3787
#6  0x00005555557907f0 in ampi::suspend (this=0x555555dcd4d0) at ampi.C:4569
#7  0x0000555555790781 in ampi::barrier (this=0x555555dcd4d0) at ampi.C:4557
#8  0x00005555557a20e8 in MPI_Comm_free (comm=0x2aaaa0903e34) at ampi.C:9014
#9  0x000055555576e42f in AMPI_Main_cpp (argc=1, argv=0x555555d49320) at test.C:490
#10 0x0000555555784994 in AMPI_Fallback_Main (argc=1, argv=0x555555d49320) at ampi.C:829
#11 0x00005555557c711d in MPI_threadstart_t::start (this=0x2aaaa0903f68) at ampi.C:1055
#12 0x000055555578518e in AMPI_threadstart (data=0x555555d45e00) at ampi.C:1075
#13 0x000055555576efad in startTCharmThread (msg=0x555555d45de0) at tcharm.C:163
#14 0x0000555555921d4f in CthStartThread (arg=...) at threads.c:1784
#15 0x000055555592220f in make_fcontext ()
#16 0x0000000000000000 in ?? ()

+p2 +vp4:

Thread 1 "pgm" received signal SIGSEGV, Segmentation fault.
0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
595         if (locMgr->lookupID(idx,id)) {
(gdb) bt
#0  0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
#1  0x000055555588a2a0 in CProxyElement_ArrayBase::ckLocal (this=0x2aaaa0903880) at ckarray.C:742
#2  0x0000555555775646 in CProxyElement_ArrayElement::ckLocal (this=0x2aaaa0903880) at ../../../../bin/../include/CkArray.decl.h:1031
#3  0x00005555557c2766 in CProxyElement_ampi::ckLocal (this=0x2aaaa0903880) at ampi.decl.h:1650
#4  0x00005555557c58a4 in ampiParent::comm2ampi (this=0x555555d47740, comm=1000002) at ampiimpl.h:2162
#5  0x000055555578e377 in getAmpiInstance (comm=1000002) at ampi.C:3787
#6  0x00005555557907f0 in ampi::suspend (this=0x555555d50f20) at ampi.C:4569
#7  0x0000555555790781 in ampi::barrier (this=0x555555d50f20) at ampi.C:4557
#8  0x00005555557a20e8 in MPI_Comm_free (comm=0x2aaaa0903e80) at ampi.C:9014
#9  0x000055555576dd7c in AMPI_Main_cpp (argc=1, argv=0x555555d4c4e0) at test.C:356
#10 0x0000555555784994 in AMPI_Fallback_Main (argc=1, argv=0x555555d4c4e0) at ampi.C:829
#11 0x00005555557c711d in MPI_threadstart_t::start (this=0x2aaaa0903f68) at ampi.C:1055
#12 0x000055555578518e in AMPI_threadstart (data=0x555555d46690) at ampi.C:1075
#13 0x000055555576efad in startTCharmThread (msg=0x555555d46670) at tcharm.C:163
#14 0x0000555555921d4f in CthStartThread (arg=...) at threads.c:1784
#15 0x000055555592220f in make_fcontext ()
#16 0x0000000000000000 in ?? ()
evan-charmworks commented 5 years ago

This patch addresses a mistake in the implementation of MPI_Comm_free. https://charm.cs.illinois.edu/gerrit/c/charm/+/5164 While it does not fix the issue completely, now the failures are clearer. On Linux, rather than dereferencing a null pointer deep in the ckLocal call, it now propagates back up to AMPI where it fails an explicit assertion. On Windows, it now fails a debug mode std::vector bounds check, and I made a separate patch to introduce an assertion there for all platforms. https://charm.cs.illinois.edu/gerrit/c/charm/+/5165

#0  0x000055555592a9ac in LrtsAbort (message=0x5555559bd180 "AMPI's getAmpiInstance> null pointer\n") at machine.C:545
#1  0x000055555592a204 in CmiAbortHelper (source=0x5555559da9f9 "Called CmiAbort", message=0x5555559bd180 "AMPI's getAmpiInstance> null pointer\n", suggestion=0x0, tellDebugger=1, framesToSkip=0) at machine-common-core.C:1755
#2  0x000055555592a233 in CmiAbort (message=0x5555559bd180 "AMPI's getAmpiInstance> null pointer\n") at machine-common-core.C:1759
#3  0x000055555578f1ca in getAmpiInstance (comm=1000002) at ampi.C:3783
#4  0x0000555555791630 in ampi::block (this=0x555555d54580) at ampi.C:4563
#5  0x00005555557915bd in ampi::barrier (this=0x555555d54580) at ampi.C:4551
#6  0x00005555557a4105 in MPI_Comm_free (comm=0x2aaaa0903e80) at ampi.C:9385
#7  0x000055555576ea90 in AMPI_Main_cpp (argc=1, argv=0x555555d4f560) at test.C:368
#8  0x00005555557857c4 in AMPI_Fallback_Main (argc=1, argv=0x555555d4f560) at ampi.C:829
#9  0x00005555557c981b in MPI_threadstart_t::start (this=0x2aaaa0903f68) at ampi.C:1055
#10 0x0000555555785fbe in AMPI_threadstart (data=0x555555d496b0) at ampi.C:1075
#11 0x000055555576fd07 in startTCharmThread (msg=0x555555d49690) at tcharm.C:163
#12 0x0000555555925d5c in CthStartThread (arg=...) at threads.C:1783
#13 0x000055555592624f in make_fcontext ()
#14 0x0000000000000000 in ?? ()
pgm.exe!std::vector<CkMigratable *,std::allocator<CkMigratable *> >::operator[](const unsigned __int64 _Pos) Line 1363
    at C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.20.27508\include\vector(1363)
pgm.exe!CkArray::recvBroadcast(CkMessage * m) Line 1604
    at charm\multicore-win-x86_64\tmp\ckarray.C(1604)
pgm.exe!CkIndex_CkArray::_call_recvBroadcast_CkMessage(void * impl_msg, void * impl_obj_void) Line 1114
    at charm\multicore-win-x86_64\tmp\CkArray.def.h(1114)
pgm.exe!CkDeliverMessageFree(int epIdx, void * msg, void * obj) Line 571
    at charm\multicore-win-x86_64\tmp\ck.C(571)
pgm.exe!_invokeEntryNoTrace(int epIdx, envelope * env, void * obj) Line 621
    at charm\multicore-win-x86_64\tmp\ck.C(621)
pgm.exe!_invokeEntry(int epIdx, envelope * env, void * obj) Line 640
    at charm\multicore-win-x86_64\tmp\ck.C(640)
pgm.exe!_deliverForBocMsg(CkCoreState * ck, int epIdx, envelope * env, IrrGroup * obj) Line 1092
    at charm\multicore-win-x86_64\tmp\ck.C(1092)
pgm.exe!_processForBocMsg(CkCoreState * ck, envelope * env) Line 1113
    at charm\multicore-win-x86_64\tmp\ck.C(1113)
pgm.exe!_processHandler(void * converseMsg, CkCoreState * ck) Line 1286
    at charm\multicore-win-x86_64\tmp\ck.C(1286)
pgm.exe!CmiHandleMessage(void * msg) Line 1661
    at charm\multicore-win-x86_64\tmp\convcore.C(1661)
pgm.exe!CsdScheduleForever() Line 1914
    at charm\multicore-win-x86_64\tmp\convcore.C(1914)
pgm.exe!CsdScheduler(int maxmsgs) Line 1842
    at charm\multicore-win-x86_64\tmp\convcore.C(1842)
pgm.exe!ConverseRunPE(int everReturn) Line 1597
    at charm\multicore-win-x86_64\tmp\machine-common-core.C(1597)
pgm.exe!call_startfn(void * vindex) Line 172
    at charm\multicore-win-x86_64\tmp\machine-smp.C(172)
[External Code]
evan-charmworks commented 5 years ago

I updated https://charm.cs.illinois.edu/gerrit/c/charm/+/5164 with an additional change and now Linux is fixed enough to pass verification. Windows however presents another error that is new to me.

Unhandled exception thrown: read access violation.
**UsrToEnv**(...) returned 0x19D69670400.

pgm.exe!CkArray::recvBroadcast(CkMessage * m) Line 1609
    at charm\multicore-win-x86_64\tmp\ckarray.C(1609)
pgm.exe!CkIndex_CkArray::_call_recvBroadcast_CkMessage(void * impl_msg, void * impl_obj_void) Line 1114
    at charm\multicore-win-x86_64\tmp\CkArray.def.h(1114)
pgm.exe!CkDeliverMessageFree(int epIdx, void * msg, void * obj) Line 571
    at charm\multicore-win-x86_64\tmp\ck.C(571)
pgm.exe!_invokeEntryNoTrace(int epIdx, envelope * env, void * obj) Line 621
    at charm\multicore-win-x86_64\tmp\ck.C(621)
pgm.exe!_invokeEntry(int epIdx, envelope * env, void * obj) Line 640
    at charm\multicore-win-x86_64\tmp\ck.C(640)
pgm.exe!_deliverForBocMsg(CkCoreState * ck, int epIdx, envelope * env, IrrGroup * obj) Line 1092
    at charm\multicore-win-x86_64\tmp\ck.C(1092)
pgm.exe!_processForBocMsg(CkCoreState * ck, envelope * env) Line 1113
    at charm\multicore-win-x86_64\tmp\ck.C(1113)
pgm.exe!_processHandler(void * converseMsg, CkCoreState * ck) Line 1286
    at charm\multicore-win-x86_64\tmp\ck.C(1286)
pgm.exe!CmiHandleMessage(void * msg) Line 1661
    at charm\multicore-win-x86_64\tmp\convcore.C(1661)
pgm.exe!CsdScheduleForever() Line 1914
    at charm\multicore-win-x86_64\tmp\convcore.C(1914)
pgm.exe!CsdScheduler(int maxmsgs) Line 1842
    at charm\multicore-win-x86_64\tmp\convcore.C(1842)
pgm.exe!ConverseRunPE(int everReturn) Line 1597
    at charm\multicore-win-x86_64\tmp\machine-common-core.C(1597)
pgm.exe!ConverseInit(int argc, char * * argv, void(*)(int, char * *) fn, int usched, int initret) Line 1492
    at charm\multicore-win-x86_64\tmp\machine-common-core.C(1492)
pgm.exe!charm_main(int argc, char * * argv) Line 1845
    at charm\multicore-win-x86_64\tmp\init.C(1845)
pgm.exe!main(int argc, char * * argv) Line 6
    at charm\multicore-win-x86_64\tmp\main.C(6)
[External Code]
evan-charmworks commented 5 years ago

When I build with --enable-tracing --enable-tracing-commthread I can no longer reproduce the issue.

evan-charmworks commented 5 years ago

I can reproduce this latest issue on Linux with ASan, using the following command: ./build AMPI multicore-linux-x86_64 --suffix=asan -j8 -g3 -fsanitize=address && pushd multicore-linux-x86_64-asan/tests/ampi/megampi && make OPTS="-g3 -fsanitize=address" -j8 && ./pgm +p2 +vp4 +noisomalloc +tcharm_nomig

==20220==ERROR: AddressSanitizer: heap-use-after-free on address 0x60800000114c at pc 0x555555ad8021 bp 0x7fffffffda60 sp 0x7fffffffda50
READ of size 1 at 0x60800000114c thread T0
    #0 0x555555ad8020 in CkArray::recvBroadcast(CkMessage*) charm/multicore-linux-x86_64-asan/tmp/ckarray.C:1613
    #1 0x555555aea008 in CkArray::recvExpeditedBroadcast(CkMessage*) charm/multicore-linux-x86_64-asan/tmp/ckarray.h:686
    #2 0x555555ae0614 in CkIndex_CkArray::_call_recvExpeditedBroadcast_CkMessage(void*, void*) charm/multicore-linux-x86_64-asan/tmp/CkArray.def.h:1212
    #3 0x5555559a9086 in CkDeliverMessageFree charm/multicore-linux-x86_64-asan/tmp/ck.C:569
    #4 0x5555559a95e2 in _invokeEntryNoTrace charm/multicore-linux-x86_64-asan/tmp/ck.C:620
    #5 0x5555559a98a5 in _invokeEntry charm/multicore-linux-x86_64-asan/tmp/ck.C:631
    #6 0x5555559ad9f4 in _deliverForBocMsg charm/multicore-linux-x86_64-asan/tmp/ck.C:1089
    #7 0x5555559adc76 in _processForBocMsg charm/multicore-linux-x86_64-asan/tmp/ck.C:1111
    #8 0x5555559af040 in _processHandler(void*, CkCoreState*) charm/multicore-linux-x86_64-asan/tmp/ck.C:1284
    #9 0x555555c57916 in CmiHandleMessage charm/multicore-linux-x86_64-asan/tmp/convcore.C:1656
    #10 0x555555c58593 in CsdScheduleForever charm/multicore-linux-x86_64-asan/tmp/convcore.C:1914
    #11 0x555555c58397 in CsdScheduler charm/multicore-linux-x86_64-asan/tmp/convcore.C:1842
    #12 0x555555c43db1 in ConverseRunPE charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1596
    #13 0x555555c435d5 in ConverseInit charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1491
    #14 0x555555988d41 in charm_main charm/multicore-linux-x86_64-asan/tmp/init.C:1845
    #15 0x555555973bcd in main charm/multicore-linux-x86_64-asan/tmp/main.C:5
    #16 0x7ffff6aa609a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)
    #17 0x55555580e489 in _start (charm/multicore-linux-x86_64-asan/tests/ampi/megampi/pgm+0x2ba489)

0x60800000114c is located 44 bytes inside of 96-byte region [0x608000001120,0x608000001180)
freed by thread T0 here:
    #0 0x7ffff72cab70 in free (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xedb70)
    #1 0x555555c36635 in free_nomigrate charm/multicore-linux-x86_64-asan/tmp/memory.C:967
    #2 0x555555c629a2 in CmiFree charm/multicore-linux-x86_64-asan/tmp/convcore.C:3178
    #3 0x5555559d98ca in CkFreeMsg charm/multicore-linux-x86_64-asan/tmp/msgalloc.C:66
    #4 0x5555559da568 in SafePool<void*>::put(void*) charm/multicore-linux-x86_64-asan/tmp/cklists.h:552
    #5 0x5555559d9250 in CkFreeSysMsg charm/multicore-linux-x86_64-asan/tmp/msgalloc.C:32
    #6 0x5555558c2434 in CkIndex_ampi::_call_setInitDoneFlag_void(void*, void*) charm/multicore-linux-x86_64-asan/tmp/libs/ck-libs/ampi/ampi.def.h:3572
    #7 0x5555559a9086 in CkDeliverMessageFree charm/multicore-linux-x86_64-asan/tmp/ck.C:569
    #8 0x555555a18f0c in CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) charm/multicore-linux-x86_64-asan/tmp/cklocation.C:1978
    #9 0x5555559b8e18 in CkMigratable::ckInvokeEntry(int, void*, bool) charm/multicore-linux-x86_64-asan/tmp/ckmigratable.h:79
    #10 0x555555ad64a1 in CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool) charm/multicore-linux-x86_64-asan/tmp/ckarray.C:1348
    #11 0x555555ad7fcd in CkArray::recvBroadcast(CkMessage*) charm/multicore-linux-x86_64-asan/tmp/ckarray.C:1609
    #12 0x555555aea008 in CkArray::recvExpeditedBroadcast(CkMessage*) charm/multicore-linux-x86_64-asan/tmp/ckarray.h:686
    #13 0x555555ae0614 in CkIndex_CkArray::_call_recvExpeditedBroadcast_CkMessage(void*, void*) charm/multicore-linux-x86_64-asan/tmp/CkArray.def.h:1212
    #14 0x5555559a9086 in CkDeliverMessageFree charm/multicore-linux-x86_64-asan/tmp/ck.C:569
    #15 0x5555559a95e2 in _invokeEntryNoTrace charm/multicore-linux-x86_64-asan/tmp/ck.C:620
    #16 0x5555559a98a5 in _invokeEntry charm/multicore-linux-x86_64-asan/tmp/ck.C:631
    #17 0x5555559ad9f4 in _deliverForBocMsg charm/multicore-linux-x86_64-asan/tmp/ck.C:1089
    #18 0x5555559adc76 in _processForBocMsg charm/multicore-linux-x86_64-asan/tmp/ck.C:1111
    #19 0x5555559af040 in _processHandler(void*, CkCoreState*) charm/multicore-linux-x86_64-asan/tmp/ck.C:1284
    #20 0x555555c57916 in CmiHandleMessage charm/multicore-linux-x86_64-asan/tmp/convcore.C:1656
    #21 0x555555c58593 in CsdScheduleForever charm/multicore-linux-x86_64-asan/tmp/convcore.C:1914
    #22 0x555555c58397 in CsdScheduler charm/multicore-linux-x86_64-asan/tmp/convcore.C:1842
    #23 0x555555c43db1 in ConverseRunPE charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1596
    #24 0x555555c435d5 in ConverseInit charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1491
    #25 0x555555988d41 in charm_main charm/multicore-linux-x86_64-asan/tmp/init.C:1845
    #26 0x555555973bcd in main charm/multicore-linux-x86_64-asan/tmp/main.C:5
    #27 0x7ffff6aa609a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)

previously allocated by thread T0 here:
    #0 0x7ffff72caf30 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xedf30)
    #1 0x555555c3661b in malloc_nomigrate charm/multicore-linux-x86_64-asan/tmp/memory.C:966
    #2 0x555555c62410 in CmiAlloc charm/multicore-linux-x86_64-asan/tmp/convcore.C:3041
    #3 0x55555598a9d7 in envelope::alloc(unsigned char, unsigned int, unsigned short, GroupDepNum) charm/multicore-linux-x86_64-asan/tmp/envelope.h:340
    #4 0x55555598b362 in _allocEnv(int, int, int, GroupDepNum) charm/multicore-linux-x86_64-asan/tmp/envelope.h:513
    #5 0x55555598b5bb in MsgPool::_alloc() (charm/multicore-linux-x86_64-asan/tests/ampi/megampi/pgm+0x4375bb)
    #6 0x55555598d764 in SafePool<void*>::SafePool(void* (*)(), void (*)(void*), void (*)(void*)) (charm/multicore-linux-x86_64-asan/tests/ampi/megampi/pgm+0x439764)
    #7 0x55555598b6ba in MsgPool::MsgPool() (charm/multicore-linux-x86_64-asan/tests/ampi/megampi/pgm+0x4376ba)
    #8 0x555555987362 in _initCharm(int, char**) charm/multicore-linux-x86_64-asan/tmp/init.C:1573
    #9 0x555555c43d9d in ConverseRunPE charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1595
    #10 0x555555c435d5 in ConverseInit charm/multicore-linux-x86_64-asan/tmp/machine-common-core.C:1491
    #11 0x555555988d41 in charm_main charm/multicore-linux-x86_64-asan/tmp/init.C:1845
    #12 0x555555973bcd in main charm/multicore-linux-x86_64-asan/tmp/main.C:5
    #13 0x7ffff6aa609a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)
evan-charmworks commented 5 years ago

Fixed