Open tvaknin opened 4 months ago
From the provided stack traces it is not clear who holds the m_GlobControlLock
. All the threads are trying to lock it. But there should be a thread that is holding it, and is probably trying to lock some other socket, that has been already locked. My gut feeling is that the other lock is m_LSLock
. Maybe there is one more thread trying to close a socket, locking this mutex.
CUDTUnited::checkBrokenSockets()
locks m_GlobControlLock
and eventually calls CUDTUnited::removeSocket(..)
-> CUDT::closeInternal()
-> m_pRcvQueue->removeListener(this);
-> lock m_LSLock
. For example here the order would be the opposite
#2 0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(..) -> lock m_GlobControlLock
#3 0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&)
#4 0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(..) -> lock m_LSLock
#5 0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5
But from the state of the GC thread that above described case is not the one happening here.
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f6ad3751411 in __GI___pthread_mutex_lock (mutex=0x7f6ad4c1e480 <srt::CUDT::uglobal()::instance+96>)
#2 0x00007f6ad4b53b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) -> lock m_GlobControlLock
#3 0x00007f6ad4b5bc58 in srt::CUDTUnited::close(int) () from ../libs/libsrt.so.1.5
#4 0x00007f6ad4b5bcec in srt::CUDT::close(int) () from ../libs/libsrt.so.1.5
#5 0x00000000004c1e78 in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>)
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2 0x00007f954b9e2b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) -> lock m_GlobControlLock
#3 0x00007f954b9e623c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) ()
#4 0x00007f954b9e629f in srt::CUDT::send(int, char const*, int, int)
#5 0x00000000004dad3d in OutSrtSocket::Send (successCnt=0x2f8d484, this=0x7f950194eeb0)
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2 0x00007f954b9ed00e in srt::CUDTUnited::checkBrokenSockets() () from ../libs/libsrt.so.1.5 -> lock m_GlobControlLock
#3 0x00007f954b9ed6a8 in srt::CUDTUnited::garbageCollect(void*) () -> lock m_GCStopLock
#4 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479
#5 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2 0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(..) -> lock m_GlobControlLock
#3 0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&)
#4 0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(..) -> lock m_LSLock
#5 0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5
#6 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479
#7 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Could this problem be reproduced with having #2893 added?
Thread sanitizer shows a potential deadlock around those functions:
m_GlobControlLock
(checkBrokenSockets()
) first, and then m_ConnectionLock
(CUDT::closeInternal()
).CRcvQueue::worker
thread locks m_LSLock
first, then m_GlobControlLock
. The order also does not look correct.CUDTUnited::listen
function locks m_ConnectionLock
, then m_LSLock
.There was some PR where m_ConnectionLock was taken out for the call to closeInternal(), but I don't remember the details.
The inversion between m_LSLock and m_ConnectionLock is known, but it was later proven that these two things can never happen simultaneously - it's just detected by the thread sanitizer that these two cases may happen in theory. It would be nice to dissolve them, but no sensible way to fix it was found.
Could this problem be reproduced with having #2893 added?
I can try this next week
I tried the fix using the branch "dev-add-socket-busy-counter". I can reproduce the same problem. let me know if you need more info.
We need somehow to identify who is holding the mutex (see my comment above). Can you check other threads? There should be some trying to lock another mutex while m_GlobControlLock
is locked.
You may try the fix in #2032. I can't find any PR with that fix which does unlocking for the time of closeInternal().
We need somehow to identify who is holding the mutex (see my comment above). Can you check other threads? There should be some trying to lock another mutex while
m_GlobControlLock
is locked.
Attached full BT of all threads.
I see 7 threads calling srt_sendmsg
and trying to lock m_GlobControlLock
to lockate the socket, one GC thread, and one thread calling getsockstate
. One srt_sendmsg
thread (Thread 26) is releasing the mutex, so it does not look like a deadlock, just too many threads waiting for m_GlobControlLock
.
It feels like the m_GlobControlLock
should at least become a shared mutex to simplify access without modification.
See also #2393.
This issue is likely similar to #2252.
Just took a quick look - m_GlobControlLock
occurs in this log 10 times, there are many threads that are trying to lock this mutex and one - OutWorker_4
- is at this moment unlocking the mutex. Definitely not this mutex is a reason of a deadlock. Must be something else, likely one of the CVs.
And the only other thing I can see where any of the SRT threads is standing is CSndUList::waitNonEmpty()
call, which in general is waiting until there's anything scheduled for sending over the channel, for which this worker is running. Should that be a point of deadlock by some reason, we might add a possibility of waiting for a limited time and on timeout we can check if the condition is somehow satisfied to detect waiting failures (that waiting part is likely the unchanged legacy code and I have doubts as to whether it should be done this way because the first condition check should be done AFTER LOCKING, and only then if not satisfied the waiting function should be executed).
CSndUList::waitNonEmpty()
gets interrupted either on a new socket added in the heap, or if the CSndQueue
is being destroyed.
@ethouris you suggest it may be causing a deadlock in the situation when m_bClosing
is being set to true
right after the check
if (!self->m_bClosing)
{
self->m_pSndUList->waitNonEmpty();
IF_DEBUG_HIGHRATE(self->m_WorkerStats.lCondWait++);
}
while this notification is also being missed
srt::CSndQueue::~CSndQueue()
{
m_bClosing = true;
if (m_pTimer != NULL)
{
m_pTimer->interrupt();
}
// Unblock CSndQueue worker thread if it is waiting.
m_pSndUList->signalInterrupt();
Hi, let me know if I can help from my side or if some info is needed.
@tvaknin We still only speculate on the possible issue here. Backtraces do not show a deadlock: there is a thread releasing the mutex. Is it a deadlock or #2252?
@maxsharabayko Can you confirm srt_sendmsg uses the same mutex that is used for configuration? - because my case is that the neptonManager thread is stuck forever.
What do you mean by "for configuration"?
For example, look at this bt:
A call to the SRT API of srt_getsockstate is used (I called it configuration - ignore this name):
`Thread 39 (Thread 0x7f3876ffd700 (LWP 46389) "neptonManager"):
Versus "srt_sendmsg" calls.
@tvaknin
@maxsharabayko Can you confirm srt_sendmsg uses the same mutex that is used for configuration? - because my case is that the neptonManager thread is stuck forever.
From your last backtrace the neptonManager
thread 39 is waiting to acquire the m_GlobControlLock
:
#1 0x00007f38c168f411 in __GI___pthread_mutex_lock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f38c2a9168e in srt::CUDTUnited::getStatus(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#3 0x00007f38c2a95a0f in srt::CUDT::getsockstate(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
Thread 26 is just about to release it:
#0 __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
#1 0x00007f38c169095e in __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:56
#2 __GI___pthread_mutex_unlock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:356
#3 0x00007f38c2b2e9bc in srt::sync::ScopedLock::~ScopedLock() () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#4 0x00007f38c2a77ff1 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) [clone .cold] () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#5 0x00007f38c2a9523c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
But there are 7 other sendmsg
threads waiting to acquire the m_GlobControlLock
as well, so the "neptonManager" thread may not have a chance to acquire it in a reasonable time.
If it is the case, then the problem is in m_GlobControlLock
taking too much responsibility (#2393). The logic must be redesigned.
Understood. I see that the neptonManager thread 39 is stuck waiting to acquire the m_GlobControlLock indefinitely, or at least for the several minutes I observed during debugging. Could this be due to a few threads sending packets (sendmsg) and holding the m_GlobControlLock for too long?
By changing the logic, do you mean modifying how the SRT library uses m_GlobControlLock and making the changes described in this issue: (https://github.com/Haivision/srt/issues/2393)?
m_GlobControlLock
guards all containers in CUDTUnited
. One of the first steps could be changing its type from a regular mutex to a readers-writer lock. Then several threads could access containers for reading simultaneously.
We can't use the std::shared_mutex
directly because we have to support C++03. A wrapper has to be added to the srt::sync
module.
Hi everyone, Potential issue: A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock.
The Backtrace of the thread being blocked:
`#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
1 0x00007f6ad3751411 in __GI___pthread_mutex_lock (mutex=0x7f6ad4c1e480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
2 0x00007f6ad4b53b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from ../libs/libsrt.so.1.5
3 0x00007f6ad4b5bc58 in srt::CUDTUnited::close(int) () from ../libs/libsrt.so.1.5
4 0x00007f6ad4b5bcec in srt::CUDT::close(int) () from ../libs/libsrt.so.1.5
5 0x00000000004c1e78 in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>) at muxer/SrtConnectionManager.cpp:244`
I have a few more threads ~6 that are also stuck on this Mutex. e.g.
`#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
2 0x00007f954b9e2b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from../libs/libsrt.so.1.5
3 0x00007f954b9e623c in srt::CUDT::sendmsg2(int, char const*, int, SRTMsgCtrl&) () from ../libs/libsrt.so.1.5
4 0x00007f954b9e629f in srt::CUDT::send(int, char const*, int, int) () from ../libs/libsrt.so.1.5
5 0x00000000004dad3d in OutSrtSocket::Send (successCnt=0x2f8d484, this=0x7f950194eeb0) at Revioly/OutSrtSocket.cpp:32`
I also found that the SRT thread "SRT:GC" holds it too: 41 Thread 0x7f94feffd700 (LWP 32188) "SRT:GC" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
`#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
2 0x00007f954b9ed00e in srt::CUDTUnited::checkBrokenSockets() () from ../libs/libsrt.so.1.5
3 0x00007f954b9ed6a8 in srt::CUDTUnited::garbageCollect(void*) () from ../libs/libsrt.so.1.5
4 0x00007f954a5ddefc in start_thread (arg=) at pthread_create.c:479
5 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95`
SRT internal sockets thread example:
`#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
2 0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) () from ../libs/libsrt.so.1.5
3 0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&) () from ../libs/libsrt.so.1.5
4 0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(srt::CUnit*, srt::sockaddr_any const&) () from ../libs/libsrt.so.1.5
5 0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5
6 0x00007f954a5ddefc in start_thread (arg=) at pthread_create.c:479
7 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95`
To Reproduce Steps to reproduce the behavior:
No problem in SRT versions 1.5.1 or 1.5.2 (I didn't check 1.5.2) But, I found the issue in version 1.5.3.
My setup has 70 SRT sockets (Senders - listeners), setting the SRT latency to 6000ms. The trouble starts in the Sender-Listener device right after the downstream device (Receiver-Caller) restarts or loses connection.
Desktop