Closed nitbhat closed 6 years ago
Original date: 2018-07-26 17:15:05
I can't manually reproduce this issue on Dexterity, even when copying the build lines from the autobuild log verbatim.
Original date: 2018-07-26 20:35:28
mpi-win-x86_64-smp
issues seem to be nondeterministic. In today's autobuild http://charm.cs.illinois.edu/autobuild/old.2018_07_26__01_01/mpi-win-x86_64-smp.txt megatest passes, but charm++/pingpong hangs with -n 2
. I tried testing manually and instead of hanging, charm++/pingpong aborts with exit code 127 during/after testing Groups.
(gdb) t 1
[Switching to thread 1 (Thread 4492.0xe08)]
(gdb) bt
#0 0x00000000776b32c4 in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1 0x00000000776079fc in ntdll!RtlAllocateHeap () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#2 0x00000000776b26ed in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#3 0x00000000776b33c4 in ntdll!EtwEventWriteStartScenario () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#4 0x00000000776079fc in ntdll!RtlAllocateHeap () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#5 0x000000014011dc3a in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 2
[Switching to thread 2 (Thread 4492.0x156c)]
(gdb) bt
#0 0x0000000077649e5a in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1 0x00000000775f323d in ntdll!RtlIsCriticalSectionLockedByThread () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#2 0x00000000773d59cd in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/system32/kernel32.dll
#3 0x000000007763383d in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 4492.0x1978)]
(gdb) bt
#0 0x000000007764993a in ntdll!ZwRemoveIoCompletion () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1 0x000007fefd5816ad in KERNELBASE!GetQueuedCompletionStatus () from /cygdrive/c/Windows/system32/KERNELBASE.dll
#2 0x00000000773c9991 in KERNEL32!GetQueuedCompletionStatus () from /cygdrive/c/Windows/system32/kernel32.dll
#3 0x000007fef05d4edc in ?? () from /cygdrive/c/Windows/system32/msmpi.dll
#4 0x000007fef05c82ac in ?? () from /cygdrive/c/Windows/system32/msmpi.dll
#5 0x000007fef0662805 in pmpi_conversion_fn_null__ () from /cygdrive/c/Windows/system32/msmpi.dll
#6 0x000007fef06191f5 in msmpi!PMPI_Iprobe () from /cygdrive/c/Windows/system32/msmpi.dll
#7 0x000000013ffb535b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
Thread ID 4 not known.
Original date: 2018-07-30 20:20:02
Using my local machine, I managed to compile mpi-win-x86_64-smp-gcc
with Microsoft MPI 9.1 and MinGW-w64 GCC 7.3.0 after fighting with things for a while.
https://www.symscape.com/configure-msmpi-v8-1-for-mingw-w64 was helpful for this, but I needed to patch to Charm++, and also handle use of MSVC-specific __int64
in mpi.h.
In this configuration I could get real backtraces by running "C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -n 2 cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe
.
(gdb) p _Cmi_numnodes
$8 = 2
(gdb) p _Cmi_mynode
$9 = 0
(gdb) t 1
[Switching to thread 1 (Thread 12652.0x2ecc)]
(gdb) bt
#0 0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2 0x00000000004982fc in CmiIdleLock_sleep (l=0x9795838, msTimeout=10) at machine-smp.c:588
#3 0x000000000049b1a4 in CmiNotifyStillIdle (s=0x986af30) at machine-common-core.C:1900
#4 0x00000000004a5401 in call_cblist_keep (l=0x97ae0b0, curWallTime=1.4869999885559082) at conv-conds.c:160
#5 0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
#6 0x000000000049f380 in CsdStillIdle () at convcore.c:1624
#7 0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
#8 0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
#9 0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0xb1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
initret=0) at machine-common-core.C:1494
#11 0x000000000041391b in charm_main (argc=1, argv=0xb1ab0) at init.C:1701
#12 0x000000000040f15f in main (argc=1, argv=0xb1ab0) at main.C:5
(gdb) bt full
#0 0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1 0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2 0x00000000004982fc in CmiIdleLock_sleep (l=0x9795838, msTimeout=10) at machine-smp.c:588
No locals.
#3 0x000000000049b1a4 in CmiNotifyStillIdle (s=0x986af30) at machine-common-core.C:1900
No locals.
#4 0x00000000004a5401 in call_cblist_keep (l=0x97ae0b0, curWallTime=1.4869999885559082) at conv-conds.c:160
old = -2
unused = 0
i = 0
len = 1
idx = 0
#5 0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
curWallTime = 1.4869999885559082
#6 0x000000000049f380 in CsdStillIdle () at convcore.c:1624
No locals.
#7 0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
isIdle = 1
msg = 0x0
CsdStopFlag_ptr = 0x985c0c0
cycle = 0
state = {localQ = 0x9796210, nodeQ = 0x9860920, schedQ = 0x985c1c0, localCounter = 0x985c140,
nodeLock = 0x26c}
#8 0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
No locals.
#9 0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
cs = 0x9795820
CmiMyArgv = 0xb1ab0
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0xb1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
initret=0) at machine-common-core.C:1494
_ii = 0
tmp = <optimized out>
stdoutbase = 0x0
stdoutpath = 0xb0000 ""
ppnSet = 0
npes = 1
plusPSet = 0
auto_provision = 0
onewth_per_host = 0
onewth_per_socket = 0
onewth_per_core = 0
onewth_per_pu = 0
onewth_active = 0
#11 0x000000000041391b in charm_main (argc=1, argv=0xb1ab0) at init.C:1701
stack_top = 0
#12 0x000000000040f15f in main (argc=1, argv=0xb1ab0) at main.C:5
No locals.
(gdb) t 2
[Switching to thread 2 (Thread 12652.0x330c)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 12652.0x2bcc)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
[Switching to thread 4 (Thread 12652.0x18dc)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 5
[Switching to thread 5 (Thread 12652.0x24bc)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 6
Unknown thread 6.
(gdb) p _Cmi_numnodes
$5 = 2
(gdb) p _Cmi_mynode
$6 = 1
(gdb) t 1
[Switching to thread 1 (Thread 5040.0x152c)]
(gdb) bt
#0 0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2 0x00000000004982fc in CmiIdleLock_sleep (l=0x98a44f8, msTimeout=10) at machine-smp.c:588
#3 0x000000000049b1a4 in CmiNotifyStillIdle (s=0x997a2f0) at machine-common-core.C:1900
#4 0x00000000004a5401 in call_cblist_keep (l=0x98bed10, curWallTime=13.652999877929688) at conv-conds.c:160
#5 0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
#6 0x000000000049f380 in CsdStillIdle () at convcore.c:1624
#7 0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
#8 0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
#9 0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0x1d1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
initret=0) at machine-common-core.C:1494
#11 0x000000000041391b in charm_main (argc=1, argv=0x1d1ab0) at init.C:1701
#12 0x000000000040f15f in main (argc=1, argv=0x1d1ab0) at main.C:5
(gdb) bt full
#0 0x00007fff085a9f34 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1 0x00007fff04b79252 in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2 0x00000000004982fc in CmiIdleLock_sleep (l=0x98a44f8, msTimeout=10) at machine-smp.c:588
No locals.
#3 0x000000000049b1a4 in CmiNotifyStillIdle (s=0x997a2f0) at machine-common-core.C:1900
No locals.
#4 0x00000000004a5401 in call_cblist_keep (l=0x98bed10, curWallTime=13.652999877929688) at conv-conds.c:160
old = -2
unused = 0
i = 0
len = 1
idx = 0
#5 0x00000000004a647f in CcdRaiseCondition (condnum=2) at conv-conds.c:524
curWallTime = 13.652999877929688
#6 0x000000000049f380 in CsdStillIdle () at convcore.c:1624
No locals.
#7 0x000000000049f7f9 in CsdScheduleForever () at convcore.c:1945
isIdle = 1
msg = 0x0
CsdStopFlag_ptr = 0x9966ae0
cycle = 0
state = {localQ = 0x98a4ed0, nodeQ = 0x996fb60, schedQ = 0x996b400, localCounter = 0x996b380,
nodeLock = 0x26c}
#8 0x000000000049f70b in CsdScheduler (maxmsgs=-1) at convcore.c:1861
No locals.
#9 0x000000000049acc6 in ConverseRunPE (everReturn=0) at machine-common-core.C:1601
cs = 0x98a44e0
CmiMyArgv = 0x1d1ab0
#10 0x000000000049a92a in ConverseInit (argc=1, argv=0x1d1ab0, fn=0x411189 <_initCharm(int, char**)>, usched=0,
initret=0) at machine-common-core.C:1494
_ii = 0
tmp = <optimized out>
stdoutbase = 0x0
stdoutpath = 0x1d0000 ""
ppnSet = 0
npes = 1
plusPSet = 0
auto_provision = 0
onewth_per_host = 0
onewth_per_socket = 0
onewth_per_core = 0
onewth_per_pu = 0
onewth_active = 0
#11 0x000000000041391b in charm_main (argc=1, argv=0x1d1ab0) at init.C:1701
stack_top = 0
#12 0x000000000040f15f in main (argc=1, argv=0x1d1ab0) at main.C:5
No locals.
(gdb) t 2
[Switching to thread 2 (Thread 5040.0x15a4)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 3
[Switching to thread 3 (Thread 5040.0x1554)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 4
[Switching to thread 4 (Thread 5040.0xc8)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 5
[Switching to thread 5 (Thread 5040.0x954)]
(gdb) bt
#0 0x00007fff085ad804 in ntdll!ZwWaitForWorkViaWorkerFactory () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff0852f856 in ntdll!RtlReleaseSRWLockExclusive () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 6
[Switching to thread 6 (Thread 5040.0x17b4)]
(gdb) bt
#0 0x0000000064943d68 in pthread_spin_lock () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
#1 0x00000000649450e6 in pthread_getspecific () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
#2 0x000000006144d801 in libgcc_s_seh-1!.emutls_get_address () from C:\MinGW-w64\mingw64\opt\bin\libgcc_s_seh-1.dll
#3 0x000000000049bd83 in ReleasePostedMessages () at machine.C:606
#4 0x000000000049c60a in LrtsAdvanceCommunication (whenidle=1) at machine.C:1188
#5 0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
#6 0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
#7 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
#8 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
#9 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
#10 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#11 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#12 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) bt full
#0 0x0000000064943d68 in pthread_spin_lock () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
No symbol table info available.
#1 0x00000000649450e6 in pthread_getspecific () from C:\MinGW-w64\mingw64\opt\bin\libwinpthread-1.dll
No symbol table info available.
#2 0x000000006144d801 in libgcc_s_seh-1!.emutls_get_address () from C:\MinGW-w64\mingw64\opt\bin\libgcc_s_seh-1.dll
No symbol table info available.
#3 0x000000000049bd83 in ReleasePostedMessages () at machine.C:606
msg_tmp = 0x0
prev = 0x0
temp = 0x0
done = 1
sts = {internal = {168623584, 0}, MPI_SOURCE = 4834491, MPI_TAG = 0, MPI_ERROR = 0}
#4 0x000000000049c60a in LrtsAdvanceCommunication (whenidle=1) at machine.C:1188
No locals.
#5 0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
doProcessBcast = 1
#6 0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
No locals.
#7 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
No locals.
#8 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
cs = 0x98a4508
CmiMyArgv = 0x98a5cd0
#9 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
index = 1
state = 0x98a4508
#10 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
No symbol table info available.
#11 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 7
[Switching to thread 7 (Thread 5040.0x4e4)]
#0 0x00007fff04bd49bf in TlsGetValue () from C:\WINDOWS\System32\KernelBase.dll
(gdb) bt
#0 0x00007fff04bd49bf in TlsGetValue () from C:\WINDOWS\System32\KernelBase.dll
#1 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#2 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#3 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) t 8
Unknown thread 8.
Does anyone see anything clearly wrong here? The spinlock in node 1, thread 6 stands out to me.
Original date: 2018-07-30 20:40:29
False alarm on the spinlock. It looks like that thread is the comm thread, and it is trying to pump messages:
(gdb) t 6
[Switching to thread 6 (Thread 5040.0x17b4)]
(gdb) bt
#0 0x00007fff085a9fd4 in ntdll!ZwRemoveIoCompletion () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1 0x00007fff04b92e92 in KERNELBASE!GetQueuedCompletionStatus () from C:\WINDOWS\System32\KernelBase.dll
#2 0x00007ffed7e0039c in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
#3 0x00007ffed7deb933 in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
#4 0x00007ffed7e69245 in msmpi!PMPI_Cancel () from C:\WINDOWS\SYSTEM32\msmpi.dll
#5 0x00007ffed7e4b1c6 in msmpi!PMPI_Iprobe () from C:\WINDOWS\SYSTEM32\msmpi.dll
#6 0x000000000049c110 in PumpMsgs () at machine.C:841
#7 0x000000000049c605 in LrtsAdvanceCommunication (whenidle=1) at machine.C:1182
#8 0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
#9 0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
#10 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
#11 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
#12 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
#13 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#14 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) bt full
#0 0x00007fff085a9fd4 in ntdll!ZwRemoveIoCompletion () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#1 0x00007fff04b92e92 in KERNELBASE!GetQueuedCompletionStatus () from C:\WINDOWS\System32\KernelBase.dll
No symbol table info available.
#2 0x00007ffed7e0039c in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#3 0x00007ffed7deb933 in pmpi_type_create_f90_integer__ () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#4 0x00007ffed7e69245 in msmpi!PMPI_Cancel () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#5 0x00007ffed7e4b1c6 in msmpi!PMPI_Iprobe () from C:\WINDOWS\SYSTEM32\msmpi.dll
No symbol table info available.
#6 0x000000000049c110 in PumpMsgs () at machine.C:841
doSyncRecv = 1
nbytes = 144
flg = 0
res = 160053872
msg = 0x49c4bb <SendMsgBuf()+30> "H%EoHƒ}o"
sts = {internal = {144, 0}, MPI_SOURCE = 0, MPI_TAG = 1375, MPI_ERROR = 169680320}
recd = 0
recvBufferTmp = 0x0
prev = 0x0
temp = 0x7fff04b79252 <WaitForSingleObjectEx+162>
opDone = 0
allOpsDone = 61693
i = 1300022429
#7 0x000000000049c605 in LrtsAdvanceCommunication (whenidle=1) at machine.C:1182
No locals.
#8 0x000000000049ad00 in AdvanceCommunication (whenidle=1) at machine-common-core.C:1621
doProcessBcast = 1
#9 0x000000000049ad1c in CommunicationServer (sleepTime=5) at machine-common-core.C:1646
No locals.
#10 0x000000000049ad74 in CommunicationServerThread (sleepTime=5) at machine-common-core.C:1665
No locals.
#11 0x000000000049ac8c in ConverseRunPE (everReturn=0) at machine-common-core.C:1596
cs = 0x98a4508
CmiMyArgv = 0x98a5cd0
#12 0x0000000000497edf in call_startfn (vindex=0x1) at machine-smp.c:159
index = 1
state = 0x98a4508
#13 0x00007fff08123034 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
No symbol table info available.
#14 0x00007fff08581431 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
No symbol table info available.
#15 0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Original date: 2018-07-30 20:58:05
Some verbose tracing messages from mpiexec:
"C:\Program Files\Microsoft MPI\Bin\mpiexec.exe" -debug 3 -n 2 cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe
[00:13156] host tree:
[00:13156] host: DESKTOP-760DEQ3, parent: 0, id: 1
[00:13156] mpiexec started smpd manager listening on port 78ef4187-9072-4460-a4a1-8b12ea139fe3
[00:13156] create manager process (using mpiexec credentials)
[00:13156] Lauching smpd as 'C:\Program Files\Microsoft MPI\Bin\smpd.exe "C:\Program Files\Microsoft MPI\Bin\smpd.exe" -p 8677 -d 11 -mgr 272 "job" -localonly'
[00:13156] smpd reading the port string from the manager
[-1:11980] Launching smpd manager instance.
[-1:11980] created set for manager listener 128
[-1:11980] smpd manager listening on port 949baea4-a466-4bb7-88ba-14f3dfab8d36
[00:13156] closing the pipe to the manager
[00:13156] DESKTOP-760DEQ3 posting a re-connect to DESKTOP-760DEQ3:949baea4-a466-4bb7-88ba-14f3dfab8d36 in left child context.
[-1:11980] version check complete, using PMP version 4.
[-1:11980] Received session header from parent id=1, parent=0, level=0
[01:11980] Connecting back to parent using host DESKTOP-760DEQ3 and endpoint 78ef4187-9072-4460-a4a1-8b12ea139fe3
[00:13156] version check complete, using PMP version 4.
[00:13156] posting command SMPD_COLLECT to left child, src=0, dest=1.
[01:11980] handling command SMPD_COLLECT src=0
[00:13156] Handling cmd=SMPD_COLLECT result
[00:13156] cmd=SMPD_COLLECT result will be handled locally
[00:13156] Finished collecting hardware summary.
[00:13156] posting command SMPD_STARTDBS to left child, src=0, dest=1.
[01:11980] handling command SMPD_STARTDBS src=0
[01:11980] sending start_dbs result command kvs = e7b71c4a-4faf-456c-b31e-cc093f6ec6ae.
[00:13156] Handling cmd=SMPD_STARTDBS result
[00:13156] cmd=SMPD_STARTDBS result will be handled locally
[00:13156] start_dbs succeeded, kvs_name: 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae', domain_name: '78246d54-4e0c-4897-a2b3-b28534f4602a'
[00:13156] creating a process group of size 2 on node 0 called e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[00:13156] launching the processes.
[00:13156] posting command SMPD_LAUNCH to left child, src=0, dest=1.
[01:11980] handling command SMPD_LAUNCH src=0
[01:11980] Successfully handled bcast nodeids command.
[01:11980] setting environment variable: <MPIEXEC_HOSTNAME> = <DESKTOP-760DEQ3>
[01:11980] env: PMI_SIZE=2
[01:11980] env: PMI_KVS=e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] env: PMI_DOMAIN=78246d54-4e0c-4897-a2b3-b28534f4602a
[01:11980] env: PMI_HOST=localhost
[01:11980] env: PMI_PORT=949baea4-a466-4bb7-88ba-14f3dfab8d36
[01:11980] env: PMI_SMPD_ID=1
[01:11980] env: PMI_APPNUM=0
[01:11980] env: PMI_SPAWN=0
[01:11980] env: PMI_NODE_IDS=s
[01:11980] env: PMI_RANK_AFFINITIES=affinity_region_11980
[01:11980] searching for 'cmd.exe' in workdir 'C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong'
[01:11980] searching for 'cmd.exe' in path ''
[01:11980] searching for 'cmd.exe' in system path
[01:11980] C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong>CreateProcess(C:\WINDOWS\SYSTEM32\cmd.exe cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe)
[01:11980] env: PMI_RANK=1
[01:11980] env: PMI_SMPD_KEY=0
[01:11980] C:\msys64\home\Evan\charm\mpi-win-x86_64-smp-gcc\tests\charm++\pingpong>CreateProcess(C:\WINDOWS\SYSTEM32\cmd.exe cmd.exe /c start C:\MinGW-w64\mingw64\bin\gdb.exe --args pgm.exe)
[01:11980] env: PMI_RANK=0
[01:11980] env: PMI_SMPD_KEY=1
[00:13156] Handling cmd=SMPD_LAUNCH result
[00:13156] cmd=SMPD_LAUNCH result will be handled locally
[00:13156] successfully launched process 1
[00:13156] successfully launched process 0
[00:13156] root process launched, starting stdin redirection.
[01:11980] version check complete, using PMP version 4.
[01:11980] forwarding command SMPD_INIT src=1 ctx_key=1
[01:11980] 1 -> 0 : returning parent_context: 0 < 1
[01:11980] posting command SMPD_INIT to parent, src=1, ctx_key=1, dest=0.
[00:13156] handling command SMPD_INIT src=1 ctx_key=1
[00:13156] init: 0:2:e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] Handling cmd=SMPD_INIT result
[01:11980] forward SMPD_INIT result to dest=1 ctx_key=1
[01:11980] handling command SMPD_BCPUT src=1 ctx_key=1
[01:11980] Handling SMPD_BCPUT command from smpd 1
ctx_key=1
rank=0
value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
result=success
[01:11980] handling command SMPD_BARRIER src=1 ctx_key=1
[01:11980] Handling SMPD_BARRIER src=1 ctx_key=1
[01:11980] initializing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae): in=1 size=2
[01:11980] incrementing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae) incount from 0 to 1 out of 2
[01:11980] version check complete, using PMP version 4.
[01:11980] forwarding command SMPD_INIT src=1 ctx_key=0
[01:11980] 1 -> 0 : returning parent_context: 0 < 1
[01:11980] posting command SMPD_INIT to parent, src=1, ctx_key=0, dest=0.
[00:13156] handling command SMPD_INIT src=1 ctx_key=0
[00:13156] init: 1:2:e7b71c4a-4faf-456c-b31e-cc093f6ec6ae
[01:11980] Handling cmd=SMPD_INIT result
[01:11980] forward SMPD_INIT result to dest=1 ctx_key=0
[01:11980] handling command SMPD_BCPUT src=1 ctx_key=0
[01:11980] Handling SMPD_BCPUT command from smpd 1
ctx_key=0
rank=1
value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9376:312
result=success
[01:11980] handling command SMPD_BARRIER src=1 ctx_key=0
[01:11980] Handling SMPD_BARRIER src=1 ctx_key=0
[01:11980] incrementing barrier(e7b71c4a-4faf-456c-b31e-cc093f6ec6ae) incount from 1 to 2 out of 2
[01:11980] all in barrier, release the barrier.
[01:11980] sending reply to barrier command 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae'.
[01:11980] sending reply to barrier command 'e7b71c4a-4faf-456c-b31e-cc093f6ec6ae'.
[01:11980] handling command SMPD_BCGET src=1 ctx_key=1
[01:11980] Handling SMPD_BCGET command from smpd 1
ctx_key=1
rank=1
value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9376:312
result=success
[01:11980] handling command SMPD_BCGET src=1 ctx_key=0
[01:11980] Handling SMPD_BCGET command from smpd 1
ctx_key=0
rank=0
value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
result=success
[01:11980] handling command SMPD_BCGET src=1 ctx_key=0
[01:11980] Handling SMPD_BCGET command from smpd 1
ctx_key=0
rank=0
value=port=0 description="[redacted IPs] DESKTOP-760DEQ3 " shm_host=DESKTOP-760DEQ3 shm_queue=9452:312
result=success
Original date: 2018-07-30 22:35:42
If I change warmupRun = true;
to warmupRun = false;
, then pingpong prints that 1D Arrays and 1D Threaded Arrays tests complete, but execution gets no further. The next test is 1D Arrays with the zero copy message send API. If I set a breakpoint on Ping1::recv_zerocopy
, it is never reached. LrtsIssueRgets
is reached on the receiver but this does not lead to the chare method running.
The MPI machine layer does not compile if I change CMK_ONESIDED_IMPL
and CMK_ONESIDED_DIRECT_IMPL
to 0 for testing.
Original date: 2018-07-31 20:44:10
I am now comparing mpi-win-x86_64-smp
to mpi-linux-x86_64-smp
. My comparable launch line on Linux is mpiexec -n 2 xterm -e gdb ./pgm
, or more specifically, mpiexec -n 2 xterm -e gdb -ex "b Ping1::recv_zerocopy" -ex "b CProxyElement_Ping1::recv_zerocopy" -ex "b LrtsIssueRgets" -ex r ./pgm
.
Original date: 2018-07-31 21:04:28
Handing it off to Nitin.
Original date: 2018-08-06 18:56:13
Fix: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4450/ https://github.com/UIUC-PPL/charm/commit/bf8f98b377ef95456a43682ee5d4ad749fe00cbb
The hang was because MPI was initialized in MPI_THREAD_SINGLE mode, but instead of the comm. thread, the worker thread was posting the recv buffer. The fix was to have the comm thread post the buffer instead of the worker thread.
This commit also includes another fix related to the zerocopy api used for nodegroups. For nodegroups, the pingpong example crashed for mpi-win-x86_64-smp when run on 1 PE because of a stack overflow. This was due to inlining of the send method causing recursive calls which caused stack overflow when run with a large number of iterations (~900). The fix was to not use inlining in this code path.
Original issue: https://charm.cs.illinois.edu/redmine/issues/1947
Link to the failed build: http://charm.cs.illinois.edu/autobuild/old.2018_07_16__01_01/mpi-win-x86_64-smp.txt
In addition to the 3 process run, I also found that it hangs for the 4 process run.