OFI replica crashes - Githubissues

jcphill commented 7 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1675

Testing NAMD replicas (8 nodes, 2 replicas) on Bridges with the non-smp ofi layer hangs or crashes most of the time in startup phase 7 right after "Info: PME USING 54 GRID NODES AND 54 TRANS NODES"

Converse/Charm++ Commit ID: v6.8.0-57-g5705b64-namd-charm-6.8.0-build-2017-Sep-12-21260
Info: useSync: 1 useProxySync: 0
Info: useSync: 1 useProxySync: 0
e kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 0-27
Charm++> Running on 4 unique compute nodes (28-way SMP).
Charm++> cpu topology info is gathered in 0.015 seconds.
Info: NAMD 2.12 for Linux-x86_64-Bridges
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60800 for ofi-linux-x86_64-iccstatic
Info: Built Tue Sep 12 17:58:38 EDT 2017 by jphillip on br005.pvt.bridges.psc.edu
Info: 1 NAMD  2.12  Linux-x86_64-Bridges  112    r423.pvt.bridges.psc.edu  jphillip
Info: Running on 112 processors, 112 nodes, 4 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.0766191 s
Info: 199.551 MB of memory in use based on /proc/self/stat
Info: Configuration file is apoa1/apoa1.namd

karthiksenthil commented 5 years ago

Original date: 2017-09-21 00:46:34

On 8 nodes of Stampede 2 with OFI layer (non-smp), the following command results in a hang at the end of simulation:

<code class="text">
ibrun -n 512 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2
</code>

karthiksenthil commented 5 years ago

Original date: 2017-09-21 21:07:31

I was able to replicate the hang for a smaller case with following command on Stampede 2:

<code class="text">
ibrun -n 4 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2
</code>

On inspection with gdb I got the following stack trace:

<code class="text">
Program received signal SIGINT, Interrupt.
0x00002aaaab4c023b in pthread_spin_trylock () from /usr/lib64/libpthread.so.0
(gdb) bt
#0  0x00002aaaab4c023b in pthread_spin_trylock ()
   from /usr/lib64/libpthread.so.0
#1  0x00002aaaaad350b3 in psmx_cq_poll_mq () from /usr/lib64/libfabric.so.1
#2  0x00002aaaaad35756 in psmx_cq_readfrom () from /usr/lib64/libfabric.so.1
#3  0x0000000000eeda48 in fi_cq_read (cq=0x14cef70, buf=0x7fffffff8920, 
    count=8) at /usr/include/rdma/fi_eq.h:375
#4  0x0000000000ef2bb2 in process_completion_queue () at machine.c:1162
#5  0x0000000000ef2d54 in LrtsAdvanceCommunication (whileidle=0)
    at machine.c:1298
#6  0x0000000000eed5b6 in AdvanceCommunication (whenidle=0)
    at machine-common-core.c:1317
#7  0x0000000000eed826 in CmiGetNonLocal () at machine-common-core.c:1487
#8  0x0000000000ef5290 in CsdNextMessage (s=0x7fffffff8bf0) at convcore.c:1779
#9  0x0000000000ef55dc in CsdSchedulePoll () at convcore.c:1970
#10 0x0000000000c49815 in replica_barrier ()
#11 0x0000000000bd6621 in ScriptTcl::run() ()
#12 0x000000000073d44d in after_backend_init(int, char**) ()
#13 0x00000000006c409b in main ()
</code>

This points to the while loop in namd/src/DataExchanger.C: 172 inside the replica_barrier function. The control is handed to the machine layer after this. It looks like a trylock failure, meaning that something that has acquired a lock is not releasing it. So, the hang is in this process trying to acquire the lock.

jcphill commented 5 years ago

Original date: 2017-09-22 01:00:26

This is a non-smp run, right? Are there other threads created by libfabric? Does pthreads require initialization?

nitbhat commented 5 years ago

Original date: 2017-09-22 14:39:45

Yes. It is a non-smp run. I'm guessing these are pthreads created by libfabric. I've asked the folks from Intel about it.

jcphill commented 5 years ago

Original date: 2017-09-22 18:27:29

But are there actually multiple threads launched, or is there just a single thread making pthread calls?

nitbhat commented 5 years ago

Original date: 2017-09-22 22:01:34

In a single instance, 1 thread is created. But I'm not sure how exactly replicas work. Will 2 replicas (2 charm instances) cause two user threads contesting for the same resource? If that's the case, then the two user threads might be causing the deadlock.

jcphill commented 5 years ago

Original date: 2017-09-25 03:48:09

Replicas never share a process, they simply distribute the processes among the replicas.

karthiksenthil commented 5 years ago

Original date: 2017-10-08 05:14:56

Gerrit patch : ~~https://charm.cs.illinois.edu/gerrit/#/c/3115/~~ https://github.com/UIUC-PPL/charm/commit/6d47d435c79b95fc447ab0aec5e61a842c845b45

charmplusplus / charm

OFI replica crashes #1675