Closed jcphill closed 7 years ago
Original date: 2017-09-21 00:46:34
On 8 nodes of Stampede 2 with OFI layer (non-smp), the following command results in a hang at the end of simulation:
<code class="text">
ibrun -n 512 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2
</code>
Original date: 2017-09-21 21:07:31
I was able to replicate the hang for a smaller case with following command on Stampede 2:
<code class="text">
ibrun -n 4 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2
</code>
On inspection with gdb I got the following stack trace:
<code class="text">
Program received signal SIGINT, Interrupt.
0x00002aaaab4c023b in pthread_spin_trylock () from /usr/lib64/libpthread.so.0
(gdb) bt
#0 0x00002aaaab4c023b in pthread_spin_trylock ()
from /usr/lib64/libpthread.so.0
#1 0x00002aaaaad350b3 in psmx_cq_poll_mq () from /usr/lib64/libfabric.so.1
#2 0x00002aaaaad35756 in psmx_cq_readfrom () from /usr/lib64/libfabric.so.1
#3 0x0000000000eeda48 in fi_cq_read (cq=0x14cef70, buf=0x7fffffff8920,
count=8) at /usr/include/rdma/fi_eq.h:375
#4 0x0000000000ef2bb2 in process_completion_queue () at machine.c:1162
#5 0x0000000000ef2d54 in LrtsAdvanceCommunication (whileidle=0)
at machine.c:1298
#6 0x0000000000eed5b6 in AdvanceCommunication (whenidle=0)
at machine-common-core.c:1317
#7 0x0000000000eed826 in CmiGetNonLocal () at machine-common-core.c:1487
#8 0x0000000000ef5290 in CsdNextMessage (s=0x7fffffff8bf0) at convcore.c:1779
#9 0x0000000000ef55dc in CsdSchedulePoll () at convcore.c:1970
#10 0x0000000000c49815 in replica_barrier ()
#11 0x0000000000bd6621 in ScriptTcl::run() ()
#12 0x000000000073d44d in after_backend_init(int, char**) ()
#13 0x00000000006c409b in main ()
</code>
This points to the while loop in namd/src/DataExchanger.C: 172 inside the replica_barrier
function. The control is handed to the machine layer after this.
It looks like a trylock failure, meaning that something that has acquired a lock is not releasing it. So, the hang is in this process trying to acquire the lock.
Original date: 2017-09-22 01:00:26
This is a non-smp run, right? Are there other threads created by libfabric? Does pthreads require initialization?
Original date: 2017-09-22 14:39:45
Yes. It is a non-smp run. I'm guessing these are pthreads created by libfabric. I've asked the folks from Intel about it.
Original date: 2017-09-22 18:27:29
But are there actually multiple threads launched, or is there just a single thread making pthread calls?
Original date: 2017-09-22 22:01:34
In a single instance, 1 thread is created. But I'm not sure how exactly replicas work. Will 2 replicas (2 charm instances) cause two user threads contesting for the same resource? If that's the case, then the two user threads might be causing the deadlock.
Original date: 2017-09-25 03:48:09
Replicas never share a process, they simply distribute the processes among the replicas.
Original date: 2017-10-08 05:14:56
Gerrit patch : https://charm.cs.illinois.edu/gerrit/#/c/3115/ https://github.com/UIUC-PPL/charm/commit/6d47d435c79b95fc447ab0aec5e61a842c845b45
Original issue: https://charm.cs.illinois.edu/redmine/issues/1675
Testing NAMD replicas (8 nodes, 2 replicas) on Bridges with the non-smp ofi layer hangs or crashes most of the time in startup phase 7 right after "Info: PME USING 54 GRID NODES AND 54 TRANS NODES"