Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.1k stars 846 forks source link

[BUG] srt-live-transmit sender stuck, repeatedly #1662

Closed gizahNL closed 3 years ago

gizahNL commented 3 years ago

Describe the bug We are running srt-live-transmit in PULL mode and have a customer connecting to a password protected stream. The past 2 weeks we have been plagued by disconnects and the inability of our customer to reconnect (or at least receive data).

This is then only resolved by us restarting the srt-live-transmit application

The following logmessage at the start of the loss period suggests this is triggered by a ack pkt with a sequence number of -1 17:36:41.963749/SRT:RcvQ:w1*E:SRT.gg: @352733066:ATTACK/IPE: incoming ack seq -1 exceeds current 1863400545 by 284083101! The code for that logmessage can be found from line 8145 in srtcore/core.cpp.

To Reproduce Steps to reproduce the behavior: -setup a password protected srt-live-transmit in listening mode that sends an mpegts stream -connect to it (customer srt: "haivision srt 1.4") -mix in invalid seq's in the ackstream (assuming this is the cause) -see the srt connection breakdown

Expected behavior Invalid sequence data should not destroy a srt session in such a way that a restart of the sending application is required.

maxsharabayko commented 3 years ago

Please specify SRT version.

-mix in invalid seq's in the ackstream (assuming this is the cause)

Could you please elaborate on this? Some weird "ack seq = -1" is received from the peer. Can it be reproduced? As I undersand, when this ACK is received, the connection is closed, and no reconnection follows, right?

Would it be possible for you to collect a network capture (with wireshark) on the receiving side?

gizahNL commented 3 years ago

Please specify SRT version.

sorry forgot to mention: this has been shown (sender) on 1.3 and 1.4.2, receiver: I don't know yet, will let you know when our customer reports back.

-mix in invalid seq's in the ackstream (assuming this is the cause)

Could you please elaborate on this? Some weird "ack seq = -1" is received from the peer. Can it be reproduced?

It happens periodically, after a few days of running srt-live-transmit as sender.

As I undersand, when this ACK is received, the connection is closed, and no reconnection follows, right?

Would it be possible for you to collect a network capture (with wireshark) on the receiving side?

I can setup tcpdump on our side (srt-live-transmit sending in pull mode), with a filter to only take in traffic transmitted to the srt port, that should hopefully limit filesize to a reasonable amount (since it might not be for a few days till it happens again).

gizahNL commented 3 years ago

Ok, it has happened again yesterday. My tcpdump was running so I've got a dump for you of the hour within which it happens. The dump also shows repeated handshake attempts by the other side after the seq -1 message, indicating that it is trying to reconnect but fails to do so. As you can see in the attached image the offending packet is packet no#91679 in the dump. 2020-11-23_09-00

dump-11-22-15_31_1606054891.pcap.gz

I also got confirmation on the version our customer is running, it's 1.4.1

gizahNL commented 3 years ago

Anyway; I guess there actually 2 issues at play here:

I'm unsure as to the cause of the first one, as we do not control the premises of our customer, however the second one should be easily reproducible and fixable.

ethouris commented 3 years ago

Pity that the pcap is only from one direction. I can see it's the receiver, and for me it looks kinda like that the receiver has stopped receiving any data at some point. Following handshakes show that this party was trying to contact the sender without success (all the time it's INDUCTION, which means that the other party doesn't respond at all). Hard to say what the other party was trying to do. I'll try to make a code analysis to see if there is any theoretical possibility to supply -1 as the ACK value.

gizahNL commented 3 years ago

Pity that the pcap is only from one direction. I can see it's the receiver, and for me it looks kinda like that the receiver has stopped receiving any data at some point. Following handshakes show that this party was trying to contact the sender without success (all the time it's INDUCTION, which means that the other party doesn't respond at all). Hard to say what the other party was trying to do. I'll try to make a code analysis to see if there is any theoretical possibility to supply -1 as the ACK value.

The stream is PSK encrypted, not sure if capturing both sides of the convo would help. If it does would it make sense to filter by packet size while dumping (I assume tcpdump is able to do this) filesize would explode by quite a lot when we capture the full stream (and since I cannot predict when it happens again it will become a chore to keep the system of running out of disk).

tbh for me the biggest issue is that our side seems to be locking up after receiving the -1 seq packet, as invalid data/invalid implementations of the SRT protocol should not be able to cause the entire system to deadlock an otherwise functional stream. So far the only resolution has been to restart our side (sending side, listening for connections)

ethouris commented 3 years ago

Ok, I found a potential problem. Looxlike the code in the old UDT didn't predict the situation that the loss list can be modified by withdrawing loss records, while this can happen in SRT due to TLPKTDROP, and the data in the pcap show that it indeed does happen.

gizahNL commented 3 years ago

Does the fix published also resolve the fact that the sender goes into a deadlock when it receives the ackseq -1 message?

ethouris commented 3 years ago

Well, it doesn't, and this requires also a separate fix. This should never happen that ackseq received is -1, and this should be simply rejected.

gizahNL commented 3 years ago

Well, it doesn't, and this requires also a separate fix. This should never happen that ackseq received is -1, and this should be simply rejected.

To me that seems like the higher priority fix ;) Since on our side we (obviously) cannot control what data gets send by the other party rejecting any invalid data seems logical if this data could otherwise deadlock the entire protocol.

ethouris commented 3 years ago

The problem is that conditions that make the connection unable to transmit any data are not always possible to be detected. UDT didn't care about this at all as it was a file transmission tool, so slowdown was allowed any any rate. In live mode SRT the only sensible way to handle it is to break the connection, but the UDT codebase wasn't really prepared for it. This here is an obvious problem of being "too trustful" for the incoming data, but there are more potential problems if we take all these things into account.

gizahNL commented 3 years ago

The problem is that conditions that make the connection unable to transmit any data are not always possible to be detected. UDT didn't care about this at all as it was a file transmission tool, so slowdown was allowed any any rate. In live mode SRT the only sensible way to handle it is to break the connection, but the UDT codebase wasn't really prepared for it. This here is an obvious problem of being "too trustful" for the incoming data, but there are more potential problems if we take all these things into account.

tbh I do not know anything in depth about the SRT protocol. Is SRT very lock-step in that it requires correct responses on both sides on control messages received? My, uninformed, solution would be to just ignore the ackseq -1 packet, and if a response is required respond with whatever would've been the most appropriate (last ackseq number or last ackseq +1 idk).

Anyway, if it's prohibitively difficult to provide a fix/workaround that does not deadlock the protocol I'll just add an assert(0); in the appropriate location since we run it under systemd that'll at least automatically restart the service, without the hassle of writing a log parser to detect a deadlocked situation.

ethouris commented 3 years ago

So, ackseq being -1 was definitely an error in one side, but then the other side shouldn't be also that trustful and interpret it blindly. OTOH it can only check if the ACK value is in the right range, and then later when doing operation on it, it can check if the value "makes sense" and also reject the message if not. But it can't handle any kind of garbage that comes in and looks like proper values.

If there is a situation when the connection is still maintained, but unable to transmit data, we are treating every case of it seriously and try to fix it if detected. Of course, the only thing that could be done when this happens is to forcefully close the connection, as - in distinction to file transmission - inability to process data in time makes the live transmission impossible anyway.

gizahNL commented 3 years ago

So, ackseq being -1 was definitely an error in one side, but then the other side shouldn't be also that trustful and interpret it blindly. OTOH it can only check if the ACK value is in the right range, and then later when doing operation on it, it can check if the value "makes sense" and also reject the message if not. But it can't handle any kind of garbage that comes in and looks like proper values.

If there is a situation when the connection is still maintained, but unable to transmit data, we are treating every case of it seriously and try to fix it if detected. Of course, the only thing that could be done when this happens is to forcefully close the connection, as - in distinction to file transmission - inability to process data in time makes the live transmission impossible anyway.

Yes forcibly closing might also make sense, assuming the other side (SRT 1.4.1) would then try to re-establish the connection again.

gizahNL commented 3 years ago

Thanks @ethouris & others for fixing!