charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
206 stars 50 forks source link

Charm pingpong hangs intermittently for mpi-win-x86_64-smp build #1947

Closed nitbhat closed 6 years ago

nitbhat commented 6 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1947


Link to the failed build: http://charm.cs.illinois.edu/autobuild/old.2018_07_16__01_01/mpi-win-x86_64-smp.txt

In addition to the 3 process run, I also found that it hangs for the 4 process run.

Running on 4 processors:  ./pgm
charmrun> /cygdrive/c/Program Files/Microsoft MPI/Bin/mpiexec -n 4  ./pgm

Charm++> Running on MPI version: 2.0
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 4 processes, 1 worker threads (PEs) + 1 comm threads per process, 0 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++ warning> fences and atomic operations not available in native assembly
Converse/Charm++ Commit ID: b1333fa
Charm++> Disabling isomalloc because mmap() does not work.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.046 seconds.
Megatest is running on 4 nodes 4 processors.
test 0: initiated [groupring (milind)]
evan-charmworks commented 5 years ago

Original date: 2018-07-26 17:15:05


I can't manually reproduce this issue on Dexterity, even when copying the build lines from the autobuild log verbatim.

evan-charmworks commented 5 years ago

Original date: 2018-07-26 20:35:28


mpi-win-x86_64-smp issues seem to be nondeterministic. In today's autobuild http://charm.cs.illinois.edu/autobuild/old.2018_07_26__01_01/mpi-win-x86_64-smp.txt megatest passes, but charm++/pingpong hangs with -n 2. I tried testing manually and instead of hanging, charm++/pingpong aborts with exit code 127 during/after testing Groups.

(gdb) t 1
[Switching to thread 1 (Thread 4492.0xe08)]
(gdb) bt
#0  0x00000000776b32c4 in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00000000776079fc in ntdll!RtlAllocateHeap () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#2  0x00000000776b26ed in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#3  0x00000000776b33c4 in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#4  0x00000000776079fc in ntdll!RtlAllocateHeap () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#5  0x000000014011dc3a in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 2
[Switching to thread 2 (Thread 4492.0x156c)]
(gdb) bt
#0  0x0000000077649e5a in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00000000775f323d in ntdll!RtlIsCriticalSectionLockedByThread () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#2  0x00000000773d59cd in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/system32/kernel32.dll
#3  0x000000007763383d in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 4492.0x1978)]
(gdb) bt
#0  0x000000007764993a in ntdll!ZwRemoveIoCompletion () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x000007fefd5816ad in KERNELBASE!GetQueuedCompletionStatus () from /cygdrive/c/Windows/system32/KERNELBASE.dll
#2  0x00000000773c9991 in KERNEL32!GetQueuedCompletionStatus () from /cygdrive/c/Windows/system32/kernel32.dll
#3  0x000007fef05d4edc in ?? () from /cygdrive/c/Windows/system32/msmpi.dll
#4  0x000007fef05c82ac in ?? () from /cygdrive/c/Windows/system32/msmpi.dll
#5  0x000007fef0662805 in pmpi_conversion_fn_null__ () from /cygdrive/c/Windows/system32/msmpi.dll
#6  0x000007fef06191f5 in msmpi!PMPI_Iprobe () from /cygdrive/c/Windows/system32/msmpi.dll
#7  0x000000013ffb535b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
Thread ID 4 not known.
evan-charmworks commented 5 years ago

Original date: 2018-07-30 20:20:02


Using my local machine, I managed to compile mpi-win-x86_64-smp-gcc with Microsoft MPI 9.1 and MinGW-w64 GCC 7.3.0 after fighting with things for a while.

https://www.symscape.com/configure-msmpi-v8-1-for-mingw-w64 was helpful for this, but I needed to patch to Charm++, and also handle use of MSVC-specific __int64 in mpi.h.

In this configuration I could get real backtraces by running "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe.

(gdb) p _Cmi_numnodes
$8 = 2
(gdb) p _Cmi_mynode
$9 = 0
(gdb) t 1
[Switching to thread 1 (Thread 12652.0x2ecc)]
(gdb) bt
#0  0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x00000000004982fc in CmiIdleLock_sleep (l=0x9795838, msTimeout=10) at machine-smp.c:588
#3  0x000000000049b1a4 in CmiNotifyStillIdle (s=0x986af30) at machine-common-core.C:1900
#4  0x00000000004a5401 in call_cblist_keep (l=0x97ae0b0, curWallTime=1.4869999885559082) at conv-conds.c:160
#5  0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
#6  0x000000000049f380 in CsdStillIdle () at convcore.c:1624
#7  0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
#8  0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
#9  0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0xb1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
    initret=0) at machine-common-core.C:1494
#11 0x000000000041391b in charm_main (argc=1, argv=0xb1ab0) at init.C:1701
#12 0x000000000040f15f in main (argc=1, argv=0xb1ab0) at main.C:5
(gdb) bt full
#0  0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1  0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2  0x00000000004982fc in CmiIdleLock_sleep (l=0x9795838, msTimeout=10) at machine-smp.c:588
No locals.
#3  0x000000000049b1a4 in CmiNotifyStillIdle (s=0x986af30) at machine-common-core.C:1900
No locals.
#4  0x00000000004a5401 in call_cblist_keep (l=0x97ae0b0, curWallTime=1.4869999885559082) at conv-conds.c:160
        old = -2
        unused = 0
        i = 0
        len = 1
        idx = 0
#5  0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
        curWallTime = 1.4869999885559082
#6  0x000000000049f380 in CsdStillIdle () at convcore.c:1624
No locals.
#7  0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
        isIdle = 1
        msg = 0x0
        CsdStopFlag_ptr = 0x985c0c0
        cycle = 0
        state = {localQ = 0x9796210, nodeQ = 0x9860920, schedQ = 0x985c1c0, localCounter = 0x985c140,
          nodeLock = 0x26c}
#8  0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
No locals.
#9  0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
        cs = 0x9795820
        CmiMyArgv = 0xb1ab0
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0xb1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
    initret=0) at machine-common-core.C:1494
        _ii = 0
        tmp = <optimized out>
        stdoutbase = 0x0
        stdoutpath = 0xb0000 ""
        ppnSet = 0
        npes = 1
        plusPSet = 0
        auto_provision = 0
        onewth_per_host = 0
        onewth_per_socket = 0
        onewth_per_core = 0
        onewth_per_pu = 0
        onewth_active = 0
#11 0x000000000041391b in charm_main (argc=1, argv=0xb1ab0) at init.C:1701
        stack_top = 0
#12 0x000000000040f15f in main (argc=1, argv=0xb1ab0) at main.C:5
No locals.
(gdb) t 2
[Switching to thread 2 (Thread 12652.0x330c)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 12652.0x2bcc)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
[Switching to thread 4 (Thread 12652.0x18dc)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 5
[Switching to thread 5 (Thread 12652.0x24bc)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 6
Unknown thread 6.
(gdb) p _Cmi_numnodes
$5 = 2
(gdb) p _Cmi_mynode
$6 = 1
(gdb) t 1
[Switching to thread 1 (Thread 5040.0x152c)]
(gdb) bt
#0  0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x00000000004982fc in CmiIdleLock_sleep (l=0x98a44f8, msTimeout=10) at machine-smp.c:588
#3  0x000000000049b1a4 in CmiNotifyStillIdle (s=0x997a2f0) at machine-common-core.C:1900
#4  0x00000000004a5401 in call_cblist_keep (l=0x98bed10, curWallTime=13.652999877929688) at conv-conds.c:160
#5  0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
#6  0x000000000049f380 in CsdStillIdle () at convcore.c:1624
#7  0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
#8  0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
#9  0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0x1d1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
    initret=0) at machine-common-core.C:1494
#11 0x000000000041391b in charm_main (argc=1, argv=0x1d1ab0) at init.C:1701
#12 0x000000000040f15f in main (argc=1, argv=0x1d1ab0) at main.C:5
(gdb) bt full
#0  0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1  0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2  0x00000000004982fc in CmiIdleLock_sleep (l=0x98a44f8, msTimeout=10) at machine-smp.c:588
No locals.
#3  0x000000000049b1a4 in CmiNotifyStillIdle (s=0x997a2f0) at machine-common-core.C:1900
No locals.
#4  0x00000000004a5401 in call_cblist_keep (l=0x98bed10, curWallTime=13.652999877929688) at conv-conds.c:160
        old = -2
        unused = 0
        i = 0
        len = 1
        idx = 0
#5  0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
        curWallTime = 13.652999877929688
#6  0x000000000049f380 in CsdStillIdle () at convcore.c:1624
No locals.
#7  0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
        isIdle = 1
        msg = 0x0
        CsdStopFlag_ptr = 0x9966ae0
        cycle = 0
        state = {localQ = 0x98a4ed0, nodeQ = 0x996fb60, schedQ = 0x996b400, localCounter = 0x996b380,
          nodeLock = 0x26c}
#8  0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
No locals.
#9  0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
        cs = 0x98a44e0
        CmiMyArgv = 0x1d1ab0
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0x1d1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
    initret=0) at machine-common-core.C:1494
        _ii = 0
        tmp = <optimized out>
        stdoutbase = 0x0
        stdoutpath = 0x1d0000 ""
        ppnSet = 0
        npes = 1
        plusPSet = 0
        auto_provision = 0
        onewth_per_host = 0
        onewth_per_socket = 0
        onewth_per_core = 0
        onewth_per_pu = 0
        onewth_active = 0
#11 0x000000000041391b in charm_main (argc=1, argv=0x1d1ab0) at init.C:1701
        stack_top = 0
#12 0x000000000040f15f in main (argc=1, argv=0x1d1ab0) at main.C:5
No locals.
(gdb) t 2
[Switching to thread 2 (Thread 5040.0x15a4)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 5040.0x1554)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
[Switching to thread 4 (Thread 5040.0xc8)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 5
[Switching to thread 5 (Thread 5040.0x954)]
(gdb) bt
#0  0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 6
[Switching to thread 6 (Thread 5040.0x17b4)]
(gdb) bt
#0  0x0000000064943d68 in pthread_spin_lock () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
#1  0x00000000649450e6 in pthread_getspecific () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
#2  0x000000006144d801 in libgcc_s_seh-1!.emutls_get_address () from C:\MinGW-w64\mingw64\opt\bin\libgcc_s_seh-1.dll
#3  0x000000000049bd83 in ReleasePostedMessages () at machine.C:606
#4  0x000000000049c60a in LrtsAdvanceCommunication (whenidle=1) at machine.C:1188
#5  0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
#6  0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
#7  0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
#8  0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
#9  0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
#10 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#11 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#12 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) bt full
#0  0x0000000064943d68 in pthread_spin_lock () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
No symbol table info available.
#1  0x00000000649450e6 in pthread_getspecific () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
No symbol table info available.
#2  0x000000006144d801 in libgcc_s_seh-1!.emutls_get_address () from C:\MinGW-w64\mingw64\opt\bin\libgcc_s_seh-1.dll
No symbol table info available.
#3  0x000000000049bd83 in ReleasePostedMessages () at machine.C:606
        msg_tmp = 0x0
        prev = 0x0
        temp = 0x0
        done = 1
        sts = {internal = {168623584, 0}, MPI_SOURCE = 4834491, MPI_TAG = 0, MPI_ERROR = 0}
#4  0x000000000049c60a in LrtsAdvanceCommunication (whenidle=1) at machine.C:1188
No locals.
#5  0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
        doProcessBcast = 1
#6  0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
No locals.
#7  0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
No locals.
#8  0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
        cs = 0x98a4508
        CmiMyArgv = 0x98a5cd0
#9  0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
        index = 1
        state = 0x98a4508
#10 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
No symbol table info available.
#11 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 7
[Switching to thread 7 (Thread 5040.0x4e4)]
#0  0x00007fff04bd49bf in TlsGetValue () from C:\WINDOWS\System32\KernelBase.dll
(gdb) bt
#0  0x00007fff04bd49bf in TlsGetValue () from C:\WINDOWS\System32\KernelBase.dll
#1  0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#2  0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#3  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 8
Unknown thread 8.

Does anyone see anything clearly wrong here? The spinlock in node 1, thread 6 stands out to me.

evan-charmworks commented 5 years ago

Original date: 2018-07-30 20:40:29


False alarm on the spinlock. It looks like that thread is the comm thread, and it is trying to pump messages:

(gdb) t 6
[Switching to thread 6 (Thread 5040.0x17b4)]
(gdb) bt
#0  0x00007fff085a9fd4 in ntdll!ZwRemoveIoCompletion () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007fff04b92e92 in KERNELBASE!GetQueuedCompletionStatus () from C:\WINDOWS\System32\KernelBase.dll
#2  0x00007ffed7e0039c in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
#3  0x00007ffed7deb933 in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
#4  0x00007ffed7e69245 in msmpi!PMPI_Cancel () from C:\WINDOWS\SYSTEM32\msmpi.dll
#5  0x00007ffed7e4b1c6 in msmpi!PMPI_Iprobe () from C:\WINDOWS\SYSTEM32\msmpi.dll
#6  0x000000000049c110 in PumpMsgs () at machine.C:841
#7  0x000000000049c605 in LrtsAdvanceCommunication (whenidle=1) at machine.C:1182
#8  0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
#9  0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
#10 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
#11 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
#12 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
#13 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#14 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) bt full
#0  0x00007fff085a9fd4 in ntdll!ZwRemoveIoCompletion () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1  0x00007fff04b92e92 in KERNELBASE!GetQueuedCompletionStatus () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2  0x00007ffed7e0039c in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#3  0x00007ffed7deb933 in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#4  0x00007ffed7e69245 in msmpi!PMPI_Cancel () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#5  0x00007ffed7e4b1c6 in msmpi!PMPI_Iprobe () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#6  0x000000000049c110 in PumpMsgs () at machine.C:841
        doSyncRecv = 1
        nbytes = 144
        flg = 0
        res = 160053872
        msg = 0x49c4bb <SendMsgBuf()+30> "H%EoHƒ}o"
        sts = {internal = {144, 0}, MPI_SOURCE = 0, MPI_TAG = 1375, MPI_ERROR = 169680320}
        recd = 0
        recvBufferTmp = 0x0
        prev = 0x0
        temp = 0x7fff04b79252 <WaitForSingleObjectEx+162>
        opDone = 0
        allOpsDone = 61693
        i = 1300022429
#7  0x000000000049c605 in LrtsAdvanceCommunication (whenidle=1) at machine.C:1182
No locals.
#8  0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
        doProcessBcast = 1
#9  0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
No locals.
#10 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
No locals.
#11 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
        cs = 0x98a4508
        CmiMyArgv = 0x98a5cd0
#12 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
        index = 1
        state = 0x98a4508
#13 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
No symbol table info available.
#14 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#15 0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
evan-charmworks commented 5 years ago

Original date: 2018-07-30 20:58:05


Some verbose tracing messages from mpiexec:

"C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -debug 3 -n 2 cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe
[00:13156] host tree:
[00:13156]  host: DESKTOP-760DEQ3, parent: 0, id: 1
[00:13156] mpiexec started smpd manager listening on port 78ef4187-9072-4460-a4a1-8b12ea139fe3
[00:13156] create manager process (using mpiexec credentials)
[00:13156] Lauching smpd as 'C:\Program Files\Microsoft MPI\Bin\smpd.exe "C:\Program Files\Microsoft MPI\Bin\smpd.exe" -p 8677 -d 11 -mgr 272 "job" -localonly'
[00:13156] smpd reading the port string from the manager
[-1:11980] Launching smpd manager instance.
[-1:11980] created set for manager listener 128
[-1:11980] smpd manager listening on port 949baea4-a466-4bb7-88ba-14f3dfab8d36
[00:13156] closing the pipe to the manager
[00:13156] DESKTOP-760DEQ3 posting a re-connect to DESKTOP-760DEQ3:949baea4-a466-4bb7-88ba-14f3dfab8d36 in left child context.
[-1:11980] version check complete, using PMP version 4.
[-1:11980] Received session header from parent id=1, parent=0, level=0
[01:11980] Connecting back to parent using host DESKTOP-760DEQ3 and endpoint 78ef4187-9072-4460-a4a1-8b12ea139fe3
[00:13156] version check complete, using PMP version 4.
[00:13156] posting command SMPD_COLLECT to left child, src=0, dest=1.
[01:11980] handling command SMPD_COLLECT src=0
[00:13156] Handling cmd=SMPD_COLLECT result
[00:13156] cmd=SMPD_COLLECT result will be handled locally
[00:13156] Finished collecting hardware summary.
[00:13156] posting command SMPD_STARTDBS to left child, src=0, dest=1.
[01:11980] handling command SMPD_STARTDBS src=0
[01:11980] sending start_dbs result command kvs = e7b71c4a-4faf-456c-b31e-cc093f6ec6ae.
[00:13156] Handling cmd=SMPD_STARTDBS result
[00:13156] cmd=SMPD_STARTDBS result will be handled locally
[00:13156] start_dbs succeeded, kvs_name: 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae', domain_name: '78246d54-4e0c-4897-a2b3-b28534f4602a'
[00:13156] creating a process group of size 2 on node 0 called e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[00:13156] launching the processes.
[00:13156] posting command SMPD_LAUNCH to left child, src=0, dest=1.
[01:11980] handling command SMPD_LAUNCH src=0
[01:11980] Successfully handled bcast nodeids command.
[01:11980] setting environment variable: <MPIEXEC_HOSTNAME> = <DESKTOP-760DEQ3>
[01:11980] env: PMI_SIZE=2
[01:11980] env: PMI_KVS=e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] env: PMI_DOMAIN=78246d54-4e0c-4897-a2b3-b28534f4602a
[01:11980] env: PMI_HOST=localhost
[01:11980] env: PMI_PORT=949baea4-a466-4bb7-88ba-14f3dfab8d36
[01:11980] env: PMI_SMPD_ID=1
[01:11980] env: PMI_APPNUM=0
[01:11980] env: PMI_SPAWN=0
[01:11980] env: PMI_NODE_IDS=s
[01:11980] env: PMI_RANK_AFFINITIES=affinity_region_11980
[01:11980] searching for 'cmd.exe' in workdir 'C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong'
[01:11980] searching for 'cmd.exe' in path ''
[01:11980] searching for 'cmd.exe' in system path
[01:11980] C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong>CreateProcess(C:\WINDOWS\SYSTEM32\cmd.exe cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe)
[01:11980] env: PMI_RANK=1
[01:11980] env: PMI_SMPD_KEY=0
[01:11980] C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong>CreateProcess(C:\WINDOWS\SYSTEM32\cmd.exe cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe)
[01:11980] env: PMI_RANK=0
[01:11980] env: PMI_SMPD_KEY=1
[00:13156] Handling cmd=SMPD_LAUNCH result
[00:13156] cmd=SMPD_LAUNCH result will be handled locally
[00:13156] successfully launched process 1
[00:13156] successfully launched process 0
[00:13156] root process launched, starting stdin redirection.
[01:11980] version check complete, using PMP version 4.
[01:11980] forwarding command SMPD_INIT src=1 ctx_key=1
[01:11980] 1 -> 0 : returning parent_context: 0 < 1
[01:11980] posting command SMPD_INIT to parent, src=1, ctx_key=1, dest=0.
[00:13156] handling command SMPD_INIT src=1 ctx_key=1
[00:13156] init: 0:2:e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] Handling cmd=SMPD_INIT result
[01:11980] forward SMPD_INIT result to dest=1 ctx_key=1
[01:11980] handling command SMPD_BCPUT src=1 ctx_key=1
[01:11980] Handling SMPD_BCPUT command from smpd 1
        ctx_key=1
        rank=0
        value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
        result=success
[01:11980] handling command SMPD_BARRIER src=1 ctx_key=1
[01:11980] Handling SMPD_BARRIER src=1 ctx_key=1
[01:11980] initializing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae): in=1 size=2
[01:11980] incrementing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae) incount from 0 to 1 out of 2
[01:11980] version check complete, using PMP version 4.
[01:11980] forwarding command SMPD_INIT src=1 ctx_key=0
[01:11980] 1 -> 0 : returning parent_context: 0 < 1
[01:11980] posting command SMPD_INIT to parent, src=1, ctx_key=0, dest=0.
[00:13156] handling command SMPD_INIT src=1 ctx_key=0
[00:13156] init: 1:2:e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] Handling cmd=SMPD_INIT result
[01:11980] forward SMPD_INIT result to dest=1 ctx_key=0
[01:11980] handling command SMPD_BCPUT src=1 ctx_key=0
[01:11980] Handling SMPD_BCPUT command from smpd 1
        ctx_key=0
        rank=1
        value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9376:312
        result=success
[01:11980] handling command SMPD_BARRIER src=1 ctx_key=0
[01:11980] Handling SMPD_BARRIER src=1 ctx_key=0
[01:11980] incrementing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae) incount from 1 to 2 out of 2
[01:11980] all in barrier, release the barrier.
[01:11980] sending reply to barrier command 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae'.
[01:11980] sending reply to barrier command 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae'.
[01:11980] handling command SMPD_BCGET src=1 ctx_key=1
[01:11980] Handling SMPD_BCGET command from smpd 1
        ctx_key=1
        rank=1
        value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9376:312
        result=success
[01:11980] handling command SMPD_BCGET src=1 ctx_key=0
[01:11980] Handling SMPD_BCGET command from smpd 1
        ctx_key=0
        rank=0
        value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
        result=success
[01:11980] handling command SMPD_BCGET src=1 ctx_key=0
[01:11980] Handling SMPD_BCGET command from smpd 1
        ctx_key=0
        rank=0
        value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
        result=success
evan-charmworks commented 5 years ago

Original date: 2018-07-30 22:35:42


If I change warmupRun = true; to warmupRun = false;, then pingpong prints that 1D Arrays and 1D Threaded Arrays tests complete, but execution gets no further. The next test is 1D Arrays with the zero copy message send API. If I set a breakpoint on Ping1::recv_zerocopy, it is never reached. LrtsIssueRgets is reached on the receiver but this does not lead to the chare method running.

The MPI machine layer does not compile if I change CMK_ONESIDED_IMPL and CMK_ONESIDED_DIRECT_IMPL to 0 for testing.

evan-charmworks commented 5 years ago

Original date: 2018-07-31 20:44:10


I am now comparing mpi-win-x86_64-smp to mpi-linux-x86_64-smp. My comparable launch line on Linux is mpiexec -n 2 xterm -e gdb ./pgm, or more specifically, mpiexec -n 2 xterm -e gdb -ex "b Ping1::recv_zerocopy" -ex "b CProxyElement_Ping1::recv_zerocopy" -ex "b LrtsIssueRgets" -ex r ./pgm.

evan-charmworks commented 5 years ago

Original date: 2018-07-31 21:04:28


Handing it off to Nitin.

nitbhat commented 5 years ago

Original date: 2018-08-06 18:56:13


Fix: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4450/ https://github.com/UIUC-PPL/charm/commit/bf8f98b377ef95456a43682ee5d4ad749fe00cbb

The hang was because MPI was initialized in MPI_THREAD_SINGLE mode, but instead of the comm. thread, the worker thread was posting the recv buffer. The fix was to have the comm thread post the buffer instead of the worker thread.

This commit also includes another fix related to the zerocopy api used for nodegroups. For nodegroups, the pingpong example crashed for mpi-win-x86_64-smp when run on 1 PE because of a stack overflow. This was due to inlining of the send method causing recursive calls which caused stack overflow when run with a large number of iterations (~900). The fix was to not use inlining in this code path.