barchart / barchart-udt

Java wrapper for native C++ UDT protocol.
https://github.com/barchart/barchart-udt/wiki
128 stars 89 forks source link

Deadlock (?) between garbage collection and RcvQueue worker thread termination #83

Open jrudolph opened 8 years ago

jrudolph commented 8 years ago

We observe a situation where UDT completely hangs with many threads stuck waiting for the m_ControlLock.

At this point the lock is held by the garbage collection thread (in checkBrokenSockets) which is waiting for a rcv queue worker thread termination:

(gdb) bt
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
#1  0x00007f5b5c3b6221 in CRcvQueue::~CRcvQueue() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c39b0bd in CUDTUnited::removeSocket(int) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b5c39baa2 in CUDTUnited::checkBrokenSockets() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#4  0x00007f5b5c39bc64 in CUDTUnited::garbageCollect(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#5  0x00007f5b9f592dc5 in start_thread (arg=0x7f5b17fff700) at pthread_create.c:308
#6  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) frame 0
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
92      lll_wait_tid (pd->tid);
(gdb) print pd->tid
$3 = 17122

The worker thread seems to be stuck in recvmsg:

Thread 7 (Thread 0x7f5afb8f2700 (LWP 17122)):
#0  0x00007f5b9f59967d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f5b5c3a0b2b in CChannel::recvfrom(sockaddr*, CPacket&) const () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c3b6fee in CRcvQueue::worker(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b9f592dc5 in start_thread (arg=0x7f5afb8f2700) at pthread_create.c:308
#4  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

This doesn't seem to be a classical deadlock, maybe it's more a problem with the blocking recvmsg call.

Has anyone an idea how this could happen?

jrudolph commented 8 years ago

I suspect that the problem is related to the one described here: https://sourceforge.net/p/udt/discussion/393036/thread/d95e119f/?limit=25#1c43

By performing the close() before deleting the queues, doesn't this allow for the possibility that between the close() and the queue deletion, a new socket using the old file descriptor could be created in another thread and one or both of the queues could improperly use that new file descriptor? I did not see any synchronization which would prevent this problem. Would moving the channel close() to happen after the queues have been deleted introduce other problems?

In my case, however, the file descriptor is not reused by UDT but by another part of the application which opens a completely unrelated TCP socket with the same file descriptor. This new socket is perfectly fine and will happily block in the recvmsg call bringing UDT to a halt completely.