Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.11k stars 852 forks source link

srt_send() error SRT_ECONNLOST or SRT_EINVSOCK depending on peer disconnection time #2098

Open lelegard opened 3 years ago

lelegard commented 3 years ago

Hi,

I have a question regarding the error which is returned by srt_send() after the peer disconnects (more precisely the error code which is returned by srt_getlasterror() after srt_send() fails).

This issue was originally raised by a TSDuck user using the srt output plugin (https://github.com/tsduck/tsduck/issues/850). This is just to give some context, the current issue is more focused on the SRT behaviour.

Consider the following scenario.

A receiver application works in listener mode and just consumes SRT data without doing anything.

A sender application works in caller mode, connects to the receiver and sends packets using srt_send(). At some point, the receiver disconnects (I interrupt it using Ctrl-C). On the sender side, the next srt_send() fails and srt_getlasterror() returns SRT_ECONNLOST. So far, so good, it seems normal.

Then, slightly modify the story. After the receiver termination, the sender does not call srt_send() immediately. Instead, without doing anything on the SRT socket, it just wait "some time" (say 30 seconds) and then call srt_send(). In that case, srt_getlasterror() returns SRT_EINVSOCK ("Invalid socket ID").

Is this difference normal? It does not look normal to me that the same conditions report different error codes. It is even more important in that case. SRT_ECONNLOST is a normal operational error, a disconnection of the peer, a situation that the application is supposed to handle somehow. SRT_EINVSOCK, on the other hand, is a severe error. It suggests an internal error in the application, an invalid value in the int variable containing a socket id, a memory corruption or an incorrect order of calls such as sending on a socket that was no opened. Anyway, the application is supposed to take some harsher action than after a normal disconnection of the peer. Maybe terminate the application on some "internal error" or "send a bug report".

Because of the difference of importance between the two error codes, for the same expectable operational event, I would vote for a bug in libsrt. What do you think?

I may suspect that some internal thread in libsrt cleans up a socket in error after a few seconds without user access. But, if my assumption is correct, I do not think that this an appropriate behaviour. The application opened the socket. As long as the application does not explicitly close it, the application has the right to consider it as open. If there was an error on the socket, it is expected to receive the appropriate error code (SRT_ECONNLOST in the case of a disconnection of the peer). But the socket id must remain valid as long as the application did not close it.

Additionally, socket ids are just integer values. As long as a socket is in error state but still open, the id cannot be reused. So, if a socket is arbitrarily closed by libsrt, without explicit close from the application, I assume that the integer value that was used for a socket id can now be reused by some other socket which is created by some other independent thread in the application. In the meantime, the previous application thread that calls srt_send() on its "valid" socket id may now use an id which was reused by accident and send data somewhere else. Which is much worse than getting an "invalid socket id" error. Maybe this scenario is impossible if some mechanism prevents the reuse of socket ids. But the idea of having a socket being closed outside the control of the application opens the doors to many scary scenarios.

Thanks for your feedback on this question.

maxsharabayko commented 3 years ago

Hi @lelegard. I would say you provide extremely valuable feedback with your question!

Then, slightly modify the story. After the receiver termination, the sender does not call srt_send() immediately. Instead, without doing anything on the SRT socket, it just wait "some time" (say 30 seconds) and then call srt_send(). In that case, srt_getlasterror() returns SRT_EINVSOCK ("Invalid socket ID"). (...) I may suspect that some internal thread in libsrt cleans up a socket in error after a few seconds without user access.

The current logic of SRT is that a broken socket is scheduled for automatic closure by SRT's garbage collector thread. Internally a socket may still exist in the "closed" state, but if so, it is no longer visible to the application, as its resources can be freed at any moment. So your assumption is correct.

This indeed provides a certain inconvenience, especially when an SRT socket is used in a blocking mode (like in your example). When an SRT socket is used in a non-blocking mode, an application can get a notification about an error on the socket through srt_epoll_wait(..) (if subscribed to SRT_EPOLL_ERR). In this case, an application is notified that an SRT socket is now broken, and thus the app can handle this event by, e.g. no longer using the socket ID, re-establishing a connection, etc.

Changing the existing behavior may provide a lot of pain both to SRT library and to 3rd party projects already using it. Still, if thinking about how to provide the desired behavior (expect the application to close a broken socket), an SRT socket option to define the type of behavior could be added. Allowing both current and the new desired behavior for backward compatibility. But that would mean a lot of effort, including further mainance of both types of behavior.

lelegard commented 3 years ago

Hi @maxsharabayko

In software engineering terms, a API which provides some "open / close" or "create / delete" semantics on a resource has no right to deallocate the resource or invalidate the "resource handle" outside application control. As long as the application has not closed the resource, the "resource handle", here the int socket id value, MUST remains valid. In case of severe error, the internal state of the resource may have been deallocated or cleaned up. But the resource handle shall remain valid, at least to report a proper state of error. Only when the application explicitly decides to close / deallocate / whatever the resource, the handle may be invalidated.

By invalidating a resource handler before the application can realize the state of error, you prevent the application from realizing that there was an error. In French, we call that "sawing the branch you sit on". In English, I think we could say "shooting yourself in the foot".

In my opinion, this is a severe design error to invalidate a resource handler outside application control. Deallocating the resource internal data (including the underlying internal UDP socket with SRT) is ok because it can be part of the semantics you define. But the resource handler must still hold some minimal state or error state until an explicit close from the application.

I understand that libsrt was designed with a specific application architecture in mind. For SRT-centric applications, it may make sense to have some event-driven architecture based on the libsrt API. But the question is: do you want to restrict SRT to applications which are designed according to a specific architecture or do you want to also allow SRT as a generic transport layer, in replacement to other transports?

The same question applies to HTTP for instance. Using libcurl, there are two levels of API. One level is event-based (using callbacks) and another level is based on tradition synchronous read/write operations. The first level is used on libcurl-centric applications (maybe the curl command itself). The second level is used by applications using interchangeable transport layers.

TSDuck, with its plugin architecture, makes very few assumptions on input and output plugins, only the availability of "start/read/stop" and "start/write/stop" mechanisms. The http plugin uses the synchronous read/write of libcurl. The srt plugin similarly uses synchronous read/write operations on SRT sockets. The buffered multi-threaded architecture of TSDuck is such that the input and output plugins are always ready to read or write on their respective devices. So, it usually works. But there are specific features such as "suspending a plugin" which was used by the user who reported the problem. Suspending an output plugin means that the TS pipeline is temporarily disconnected from the output (the packets are dropped). When the user resumed the output plugin, the SRT socket was not only disconnected but also deallocated.

I think we can live with that. The user submitted a PR to TSDuck to consider SRT_EINVSOCK as a synonym to SRT_ECONNLOST and I merged it. So, we have a workaround, but a rather dirty one.

What about the potential reuse of integer values for SRT socket handler in the meantime?

maxsharabayko commented 3 years ago

@lelegard

In software engineering terms, a API which provides some "open / close" or "create / delete" semantics on a resource has no right to deallocate the resource or invalidate the "resource handle" outside application control. As long as the application has not closed the resource, the "resource handle", here the int socket id value, MUST remains valid. In case of severe error, the internal state of the resource may have been deallocated or cleaned up. But the resource handle shall remain valid, at least to report a proper state of error. Only when the application explicitly decides to close / deallocate / whatever the resource, the handle may be invalidated.

Absolutely agree with you on this! In this case we are dealing with a legacy design error. Not having control over all the integrations out there, changing this behavior needs an extremely delicate handling. C'est la vie. Même le soleil a des taches noires. I'm sure a man is walking somewhere, limping on his right leg.

What about the potential reuse of integer values for SRT socket handler in the meantime?

It is another possible negative outcome of closing a socket internally. However, can happen only under very specific conditions. When an SRT instance is initialized (srt_startup(..), or, in fact, in the constructor of CUDTUnited) an initial random Socket ID is determined. All consecutive sockets would have an ID decremented by one. See CUDTUnited::generateSocketID(..) until there all IDs are occupied. In that case SRT tries to find something not occupied or fails to create a socket. See also #1823.

Socket ID value range.

The most significant bit 31 (sign bit) is left unused so that checking for a value <= 0 identifies an invalid socket ID.

lelegard commented 3 years ago

When an SRT instance is initialized (srt_startup(..), or, in fact, in the constructor of CUDTUnited) an initial random Socket ID is determined. All consecutive sockets would have an ID decremented by one.

OK, thanks. Although theoretically possible, the probability of a reused id becomes zero in practice. I was afraid of something like Unix file descriptors: if you close(1), the next open() almost always returns 1.

rationalsa commented 1 year ago

This is honestly such a broken design, it needs to be stated very explicitly and very visibly in the documentation that sockets are expected to just disappear on you. I've not seen any other API that works like this. It sounds like it's pretty much impossible to even write non-racy code using libsrt.

E.g. the issue I was debugging before finding this obscure GH issue was that I was getting some occasional:

*E:SRT.ea: remove_usock: @<insert_sock_id> not found as either socket or group. Removing only from epoll system.
followed by
GC*E:SRT.ei: epoll/update: IPE: update struck E1 which is NOT SUBSCRIBED to @<insert_sock_id>

when calling srt_epoll_remove_usock(<insert_sock_id>) and srt_close(<insert_sock_id>) after getting an SRT_EPOLL_ERR event on the very socket.

So you have three bad options: 1) Remove the socket from the epoll and close it on SRT_EPOLL_ERR, and expect to sometimes see the runtime warnings above; or 2) Don't close the socket, let the SRT_EPOLL_ERR spam your epoll until the GC closes the socket and hope you don't end up with more closed but not garbage collected sockets than your srt_epoll_wait() rnum/wnum; or 3) Periodically poll all your sockets with a function that will return SRT_EINVSOCK to check if they're still connected, rather than using epoll

lelegard commented 1 year ago

@rationalsa, as original author of this issue, I support your view. Definitely.

@maxsharabayko, I understand that there was an original design error and you quite honestly acknowledged that at the time of this report. I also understand that the problem may be difficult to fix.

But, now, two years have passed since I originally reported this. It is time to do something at last, don't you think?

Additionally, since then, an SRT competitor emerged with RIST. As author of TSDuck, I don't want to provide any judgment or preference. TSDuck has now srt and rist input and output plugins. Both options are possible. Honestly, I don't use any of the two myself. So, I could simply say that I don't care. Libsrt and librist have they individual interesting ideas and pitfalls. Users will judge the reliability of the two. In general, not only in a TSDuck context. Normally, only the qualities of the respective protocols should matter in that comparison. In the presence of the RIST competition, it would be a pity if SRT would suffer from a bad reputation only because of some stupid internal socket state handling. This is why I recommend to fix this at last.