Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.13k stars 856 forks source link

libsrt should not handle sigint/term and similar #787

Closed toots closed 4 years ago

toots commented 5 years ago

Hi,

While integrating the library with savonet/liquidsoap, I ran into issues where the library catches SIGINT and closes the opened sockets.

This causes concurrency issues in the application. In short terms, the socket is closed by the library, the streaming thread enters a new listening state, gets a new client and starts a blocking decoding loop before the main thread gets a chance to act upon the same signal and tell the streaming source to stop.

I believe that a library should not catch and act upon such signals but instead delegate to the main program operating it the order of operations when such signal is received..

maxsharabayko commented 5 years ago

SIGINT is only handled by sample applications like srt-live-transmit. The library itself does not catch those exceptions. @ethouris, you thoughts?

ethouris commented 5 years ago

No, SRT doesn't operate with signals at all. There's no call to signal or sigaction anywhere.

There is included <csignal> in channel.cpp, but it's probably a mistake - I removed it for test and the linux build passes.

If anyone has caught SIGINT, this definitely wasn't SRT library. The only places where we had any dealing with signals are applications.

Note also that when you get the signal delivered to an application, it's undefined which thread will be interrupted for handling it. You can control it a bit by using pthread_sigmask (on Linux, at least, not sure how it works on Windows). As well as that SIGINT has a default action to terminate the application after the signal.

toots commented 5 years ago

Hi again and thanks for looking at it. I should indeed have looked at the code before filling up this ticket. It looks like the library is not using signals.

However, I have double checked the application and it follows the usual POSIX pattern of:

But even with that, I still see logs such as:

^C2019/08/06 00:07:33 [input.srt_5171:3] Feeding failed: Error(Econnlost,Connection was broken.)

This happens systematically first, and before any other thread in the application get to do anything. I have also confirmed it with calling kill -KILL instead of issuing ctrl+c.

This still looks to me as the application receiving a CONNLOST error right after receiving SIGINT, which prompted my initial reaction.

ethouris commented 5 years ago

Somewhere deep in my memory there is some reported problem with signals messing up with internals of SRT, but I believe this was on Mac. Never found this on Linux. And we had some issue reported for that, but it was fixed.

What I know, which may have something to do with this, is that a signal also interrupts a system call, and make the system function return -1 with errno == EINTR. SRT uses system calls to send and receive data over UDP sockets, but I believe these cases are handled.

If you can reproduce the problem, would you be able to set up logging and grab SRT logs so that we know what is happening inside? Unfortunately the only level that would tell us anything is DEBUG, so you would have to compile SRT with --enable-heavy-logging and set the debug log level from the srt_setloglevel function.

toots commented 5 years ago

Your memory serves you right! I am indeed on OSX.. I did as suggested and here's the log:

2019/08/06 18:50:02 [srt:5] 18:50:02.758311 D:SRT.c: %625214727:RECEIVED: seq=1310140120 offset=3 ACCEPTED (EXPECTED/ORIGINAL) FLAGS: PB_SOLO ORD_RELAXED EK_NOENC SN_ORIGINAL
2019/08/06 18:50:02 [srt:5] 18:50:02.758416 D:SRT.d: crypter: data not encrypted, returning as plain
2019/08/06 18:50:02 [srt:5] 18:50:02.758449 D:SRT.c: updateCC: EVENT:checktimer
2019/08/06 18:50:02 [srt:5] 18:50:02.758458 D:SRT.c: getRcvReadyMsg: found packet, but not ready to play (only in 34.157ms)
2019/08/06 18:50:02 [srt:5] 18:50:02.758503 D:SRT.c: udpateCC: emitting signal for EVENT:checktimer
2019/08/06 18:50:02 [srt:5] 18:50:02.758566 D:SRT.t: %625214727:tsbpd: FUTURE PACKET seq=1310140105 T=18:40:21.706670 - waiting 34.049ms
2019/08/06 18:50:02 [srt:5] 18:50:02.758650 D:SRT.c: updateCC: updated values from congctl: interval=10879tk (10.88us) cgwindow=8.19e+03
2019/08/06 18:50:02 [srt:5] 18:50:02.759177 D:SRT.c: udpateCC: finished handling for EVENT:checktimer
2019/08/06 18:50:02 [srt:5] 18:50:02.759213 D:SRT.c: worker: result for the unit: INDUCED/CONCLUDING
2019/08/06 18:50:02 [srt:5] 18:50:02.759267 D:SRT.c: worker: RECEIVED PACKET --> updateConnStatus. cst=INDUCED/CONCLUDING id=625214727 pkt-payload-size=1316
2019/08/06 18:50:02 [srt:5] 18:50:02.779222 D:SRT.c: INCOMING PACKET: BOUND=0.0.0.0:8000 TARGET=625214727 CONTROL: shutdown size=4
2019/08/06 18:50:02 [srt:5] 18:50:02.779336 D:SRT.c: %625214727:incoming UMSG:5 (shutdown) socket=%625214727
2019/08/06 18:50:02 [srt:5] 18:50:02.779460 D:SRT.t: recvmsg: DATA COND: KICKED.
2019/08/06 18:50:02 [srt:5] 18:50:02.779547 D:SRT.c: getRcvReadyMsg: found packet, but not ready to play (only in 13.069ms)
2019/08/06 18:50:02 [srt:5] 18:50:02.779690 D:SRT.t: %625214727:tsbpd: EXITING
2019/08/06 18:50:02 [input.srt_5171:3] Feeding failed: Error(Econnlost,Connection was broken.)

It looks like the polling worker receives a shutdown packet just after sending SIGINT..

maxsharabayko commented 5 years ago

@ethouris, @toots. Is there any workaround for this?

toots commented 5 years ago

From my point of view, this is a minor bug that I reported for completeness. I believe that @ethouris is right in that it only occurs in OSX although I've never properly tried to reproduce on linux. For our target use, we are running linux background processes and do not expect to run into this scenario.

ethouris commented 5 years ago

I still don't get what kind of bug is here. In the above log I can see only a reaction on a SHUTDOWN control packet sent from the peer. How exactly this SHUTDOWN from the peer is related to receiving the SIGINT signal on the agent is completely unclear. So this is either a telepathy between two parties, or the agent is actually sending a SHUTDOWN signal first to the peer, so the SHUTDOWN received from the peer is only a "death cry" - but to confirm this I'd need a longer log (there's no such thing in this one visible) and possibly also a log from the peer in order to make sure what really was the reason for the peer to send the SHUTDOWN signal.

toots commented 5 years ago

I agree that the cause of the issue is confusing. However, I wouldn't dismiss it too quickly. We just had the following report, which seems related and only started happening with the 1.4 release: https://github.com/savonet/liquidsoap/issues/965

toots commented 5 years ago

Hi again! I can definitely confirm that https://github.com/savonet/liquidsoap/issues/965 relates to libsrt acting upon SIGINT concurrently to the application. The epoll loop returns an error after sending SIGINT and before the application has been able to do anything about it. It would be nice to get to the bottom of this since, this time it happened on linux and there isn't much that the application can do except unconditionally ignoring this error.

toots commented 5 years ago

Here's the log, from the application. Happy to work on a minimal reproduction case if needed:

2019/09/22 20:53:47 [clock.wallclock_input.srt_6444:3] Streaming loop starts, synchronized with wallclock.
^C2019/09/22 20:53:48 [threads:2] Queue generic queue #1 crashed with exception Error(Einvparam,Operation not supported: Bad parameters)
Raised at file "src/srt.ml", line 73, characters 11-37
Called from file "src/srt.ml", line 283, characters 10-156
rndi commented 5 years ago

Hi @toots ,

Yes, I think having a minimal reproduction case would be great. I just looked through SRT epoll API implementation and I don't see anything that would make it exit on SIGINT on any of the supported platforms. Either my eyes are deceiving me or the problem is elsewhere.

ethouris commented 5 years ago

Well, I took a look at the imp for srt_epoll_wait and the part where system functions are used. But also not, the system functions used there are called with error reported by -1 as the number of ready sockets, all values of 0 or less are completely ignored. This function exits exclusively in case when any sockets are ready or timeout passes.

Anyway, I'd need a log from both sides when this happens, which should embrace at least the last portion of a successful transmission - that's the only way to know why the SIGINT signal has caused the transmission to be interrupted.

maxsharabayko commented 5 years ago

Maybe catch (...) in a couple of places is to be blamed? For example in CUDT::close():

try
{
   for (set<int>::iterator i = m_sPollID.begin(); i != m_sPollID.end(); ++ i)
      s_UDTUnited.m_EPoll.remove_usock(*i, m_SocketID);
}
catch (...)
{
}

@toots An exception can be caught by a debugger, before it is handled. This way you could see the place where it is handled.

toots commented 5 years ago

Hi guys,

Thanks for looking into this. I've been busy with other tasks lately but I'm gonna try to get back at this issue asap.

ethouris commented 4 years ago

Ok, we still don't know what to do with it. @toots, if you are still concerned with this, please reopen the bug and provide the extra information as needed. Now it will be closed due to inactivity.

toots commented 4 years ago

Thank you and thanks for the time spent looking at this issue. I'll let you know if I have more info/can reproduce. So far, our setup is stable and we're happy with it.