Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3k stars 824 forks source link

[BUG] Haivision SRT Gateway out of spec behaviour #2839

Open kierank opened 6 months ago

kierank commented 6 months ago

Describe the bug libsrt should follow the spec

To Reproduce Send a stream to Haivision Gateway Version 3.7.6 running SRT lib version 1.5.2

Expected behavior The spec to be followed and ACK and ACKACK to be sent. Instead srt-live-transmit stops transmitting for a bit, decides to drop some messages then remote sends keepalives and no ACK and ACKACKs for the rest of the session even whilst data flows.

Screenshots image

Can you let me know what I have to implement in my implementation to match libsrt's buggy behaviour?

ethouris commented 6 months ago

Keepalive usually means that for some time the sender side didn't send any data. There can be in general two reasons for it:

This exceeded flight window size is a likely case here if we state that the transmission stopped for some time and has resumed after the second ACK has signed off more packets. Whether it was the case, it can be determined from the contents of the ACK packet, so it might help if you attach the pcap and also show parameters of the transmission (not only SRT configuration, but also stream bitrate).

kierank commented 6 months ago

The sender was sending data, it's a continuous CBR MPEGTS. It never stopped and you can reproduce this issue with libsrt at 9.2 seconds every time.

The out of spec behaviour is the fact data continues to flow for the rest of the session but there are no longer ACK or ACKACK packets.

This is reproducible with my independent SRT implementation (albeit my implementation doesn't decide to randomly drop packets at source).

If I can find a way to remove identifying information from the pcap of mine (good with remote violating the spec) vs libsrt (loss of messages and remote violating spec) I will do so.

This is very clearly a bug on the Haivision Gateway end, removing the tag doesn't change that.

On Mon, 8 Jan 2024, 04:35 Sektor van Skijlen, @.***> wrote:

Keepalive usually means that for some time the sender side didn't send any data. There can be in general two reasons for it:

  • Your data source has stopped delivering data
  • Sending was paused due to exceeding the sender buffer capacity or the flight window size

This exceeded flight window size is a likely case here if we state that the transmission stopped for some time and has resumed after the second ACK has signed off more packets. Whether it was the case, it can be determined from the contents of the ACK packet, so it might help if you attach the pcap and also show parameters of the transmission (not only SRT configuration, but also stream bitrate).

— Reply to this email directly, view it on GitHub https://github.com/Haivision/srt/issues/2839#issuecomment-1880651772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDEEGJSQJTEZPIORD7BJDYNO4UJAVCNFSM6AAAAABBKDHAW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGY2TCNZXGI . You are receiving this because you authored the thread.Message ID: @.***>

ethouris commented 6 months ago

If you have specified the msgttl parameter for sending and the value is other than -1, then the packet will be dropped if it was not possible to send it in the given time.

There are tools for pcap editing and e.g. changing the IP addresses in the packets.

kierank commented 6 months ago

I am using default srt-live-transmit settings

kierank commented 6 months ago

Libsrt pcap: https://obe.tv/Downloads/srt/flood_libsrt_official_anon.pcapng

You can see that at 9.2 seconds it decides to stop sending and drops messages. Remote stops sending ACKs and instead sends keepalives. There is obviously data still being sent.

Upipe SRT implementation (mine): https://obe.tv/Downloads/srt/flood_upipe_anon.pcapng

The stream continues without dropping messages at 9.2 seconds. But remote no longer sends ACKs and sends keepalives.

This is very obviously a bug with a particular Haivision Gateway version. Haivision Gateway still shows green and no errors in this situation.

Are you able to explain what third-party SRT implementations are meant to do in this situation? Do we need to emulate the gateway's buggy behaviour and report green in spite of the fact no ACKs are being sent?

ethouris commented 6 months ago

Ok, so just as I thought - the stopped transmission was actually caused by the fact that the remote side (you say it's Haivision Gateway) either reads too slow, or has stopped reading. Packets are not being transmitted since that moment on, even if they appear in the pcap. See for yourself that the ACK control packet reports that the receiver buffer has only 2 packet cells left. In this case the sender stops sending and waits for any response from the other side. The other side cannot accept any new packets because it doesn't have where to put them.

If you take a look at all previous ACK packets and the value in ACKD_BUFFERLEFT, you should see that this value systematically decreases, until it reaches 2, as shown on the screenshot. This should not normally happen - the "buffer left" value should be more-less stable and represent the "latency buffering fragment" size, possibly plus flight.

image

Yes, the reason why two packets have been dropped while the rest continue, is worth investigating, but then note that the situation you are facing is not an expected behavior. In the file transmission mode it can be recovered by simply waiting until the receiver extracts the packets from the buffer, but there's nothing sensible you can do in the live mode.

@maxsharabayko :

Reproduction procedure: make one app resend the UDP input and the other side establish an SRT connection, but freeze the reading.

To explain is:

kierank commented 6 months ago

Packets are not being transmitted since that moment on, even if they appear in the pcap.

What do you mean by this? The pcap is what the network card puts on the wire. There are clearly data packets in the pcap in the screenshot...

The other side cannot accept any new packets because it doesn't have where to put them.

The Haivision Gateway reports green, it reports an increasing number of received packets and it even handles retransmissions in this broken state.

I ask again, how do we interoperate with this behaviour of the gateway? Do we need to consider the case of no ACK/ACKACK and just keepalive with data flowing as a working state (which is what the gateway does)?

ethouris commented 6 months ago

Packets are not being transmitted since that moment on, even if they appear in the pcap.

What do you mean by this? The pcap is what the network card puts on the wire. There are clearly data packets in the pcap in the screenshot...

But it doesn't mean that anything happens to them on the receiver side.

The other side cannot accept any new packets because it doesn't have where to put them.

The Haivision Gateway reports green, it reports an increasing number of received packets and it even handles retransmissions in this broken state.

I can't speak for Haivision Gateway, maybe this state is kept for some time and then it changes. But what you described definitely isn't the case at least at the range of packets that precede the erroneous behavior. The pcap file doesn't even contain any loss reports, so it definitely doesn't handle retransmissions in this test case.

I'm just saying what I can see in the packets shown in the pcap - the received ACK packet contains information that the receiver buffer on the receiver side is full and simply since some moment the application at the receiver side stopped extracting packets from SRT. So they remain in the buffer until it gets full. As I said, see for yourself the contents of the ACKD_BUFFERLEFT field in the ACK packet's SRT data. You can see also that in the last few ACK packets the number by which the value in this field decreases is equal to the number of packets since the last ACK, or more precisely, to the difference between the ACK sequences in two consecutive ACK packets.

I ask again, how do we interoperate with this behaviour of the gateway? Do we need to consider the case of no ACK/ACKACK and just keepalive with data flowing as a working state (which is what the gateway does)?

Note first that the situation when the receiver doesn't read packets shall not happen, but still an application can potentially do it, leaving the sender in trouble. This is handled by the system in case of UDP, but SRT would require a special handling here and it isn't even designed as a concept.

The situation is much easier to recognize - just interpret the contents of the ACK packet. The ACKD_BUFFERLEFT field contains the number that you should treat as a counter of how many packets you can still send. When sending packets has made this value reach zero, you should not send packets anymore, until the situation is resolved. Breaking the connection in such a situation is an acceptable behavior in the live mode. You can also try to recover by waiting for the next ACK, but this is unlikely to do any miracles, and waiting indefinitely for a free space, with held up sending, will then lead to filling up the sender buffer and make the application unable to send.

Normally when the application is reading packets too slow, it comes eventually to a "runaway train" situation, that is, the difference between the sequence number of the last buffered packet and of the incoming packet exceeds the buffer size. In that case the transmission is impossible to recover and the connection breaks. Here the situation is a little different - the receiver doesn't read packets at all and the buffer gets eventually full, but the connection is still maintained, even though the recovery also isn't possible in thie case.

What you pointed out is definitely an unwanted behavior in SRT. I was able to reproduce the behavior you described and I'm still investigating it.

kierank commented 6 months ago

But it doesn't mean that anything happens to them on the receiver side.

Like I said, the Haivision Gateway is doing things with the packets and is in a seemingly functional state. End users are using this behaviour in production today.

I can't speak for Haivision Gateway, maybe this state is kept for some time and then it changes. But what you described definitely isn't the case at least at the range of packets that precede the erroneous behavior. The pcap file doesn't even contain any loss reports, so it definitely doesn't handle retransmissions in this test case.

This is just a sample capture of twenty seconds. Over longer running sessions (as you will see) the behaviour of the Haivision Gateway is functional.

I'm just saying what I can see in the packets shown in the pcap - the received ACK packet contains information that the receiver buffer on the receiver side is full and simply since some moment the application at the receiver side stopped extracting packets from SRT. So they remain in the buffer until it gets full. As I said, see for yourself the contents of the ACKD_BUFFERLEFT field in the ACK packet's SRT data. You can see also that in the last few ACK packets the number by which the value in this field decreases is equal to the number of packets since the last ACK, or more precisely, to the difference between the ACK sequences in two consecutive ACK packets.

BUFFERLEFT can say whatever it wants. Again, the Haivision Gateway is doing things with the packets and is in a functional state from an end-user point of view. We must therefore ignore BUFFERLEFT (as libsrt does currently).

What you pointed out is definitely an unwanted behavior in SRT. I was able to reproduce the behavior you described and I'm still investigating it.

I am glad you are able to reproduce this. From an end-user standpoint SRT is functional with libsrt in this state with a product Haivision has shipped. Therefore this is now de-facto SRT behaviour so a third-party implementation needs to implement this behaviour as a feature. So this needs to be documented somewhere.

This is literally what happened to me, a customer asked why our implementation does not work into Haivision Gateway, but libsrt does.

ethouris commented 6 months ago

But it doesn't mean that anything happens to them on the receiver side.

Like I said, the Haivision Gateway is doing things with the packets and is in a seemingly functional state. End users are using this behaviour in production today.

I'm just telling you what is happening on particular sides basing on what I can see in the information in the control packets in the pcap file you supplied. It's not my speculation.

This is just a sample capture of twenty seconds. Over longer running sessions (as you will see) the behaviour of the Haivision Gateway is functional.

This pcap also starts with the handshake, so it represent the whole connection session. What I said holds true as long as we are talking about this very connection only.

I'm just saying what I can see in the packets shown in the pcap - the received ACK packet contains information that the receiver buffer on the receiver side is full and simply since some moment the application at the receiver side stopped extracting packets from SRT. So they remain in the buffer until it gets full. As I said, see for yourself the contents of the ACKD_BUFFERLEFT field in the ACK packet's SRT data. You can see also that in the last few ACK packets the number by which the value in this field decreases is equal to the number of packets since the last ACK, or more precisely, to the difference between the ACK sequences in two consecutive ACK packets.

BUFFERLEFT can say whatever it wants. Again, the Haivision Gateway is doing things with the packets and is in a functional state from an end-user point of view. We must therefore ignore BUFFERLEFT (as libsrt does currently).

It doesn't. Here's excerpt from srt::CUDT::processCtrlAck:

        const int cwnd1   = std::min(int(m_iFlowWindowSize), int(m_dCongestionWindow));
        const bool bWasStuck = cwnd1<= getFlightSpan();
        // Update Flow Window Size, must update before and together with m_iSndLastAck
        m_iFlowWindowSize = ackdata[ACKD_BUFFERLEFT];

Then the too low m_iFlowWindowSize field value can prevent sending packets:

bool srt::CUDT::packUniqueData(CPacket& w_packet)
{
    int current_sequence_number; // reflexing variable
    int kflg;
    time_point tsOrigin;
    int pld_size;

    {
        ScopedLock lkrack (m_RecvAckLock);
        // Check the congestion/flow window limit
        const int cwnd    = std::min(int(m_iFlowWindowSize), int(m_dCongestionWindow));
        const int flightspan = getFlightSpan();
        if (cwnd <= flightspan)
        {
            HLOGC(qslog.Debug,
                    log << CONID() << "packUniqueData: CONGESTED: cwnd=min(" << m_iFlowWindowSize << "," << m_dCongestionWindow
                    << ")=" << cwnd << " seqlen=(" << m_iSndLastAck << "-" << m_iSndCurrSeqNo << ")=" << flightspan);
            return false;
        }

BUFFERLEFT field tells you how many free cells the receiver party has in the receiver buffer at the moment when the ACK packet is being constructed. This is theoretically not the state that should hold forever because the receiver application can extract some packets from the receiver buffer in the meantime as the ACK packet is sent and then received by the sender party. But it would be miraculous if an application cannot read 8192 packets from the receiver buffer for a long time and then suddenly reads the whole buffer in 10ms. So the best behavior would be if you interpret this value, stop sending if it reaches zero, and then wait for the next ACK, which of course may never come in. If you continue to send packets anyway in their normal sequence order, they'll evaporate. If the receiver application starts then reading packets at some point, you'll get into the "runaway train" situation and the receiver will close the connection.

Once thing is certain: if you get BUFFERLEFT value 0, any packets you send will be dropped. They will be then requested to be retransmitted, but only as long as the receiver application can acknowledge more packets, and it can't, if it can't put them into the buffer, and that it can't again, until any packets from the receiver buffer are extracted by the application. What is happening in this case on the receiver side, it's part of the design and should be documented. What we are discussing currently is what the sender party should do in such a case. To my best knowledge this isn't documented and the behavior in such a case is the category of "undefined behavior", so it doesn't really matter what the current SRT library does. I'm glad you reported that because this is highly unwanted what the SRT library currently does - but still it's not the case of out-of-spec behavior.

I have recommended you, if you have an alternative implementation, what you can do to recognize this behavior properly (relying on KEEPALIVEs is not really reliable) and maintain any valuable resources and predictable behavior of the sender party, when it happens. But you are free to do what you want here - the specification doesn't define it, at least for now.

What you pointed out is definitely an unwanted behavior in SRT. I was able to reproduce the behavior you described and I'm still investigating it.

I am glad you are able to reproduce this. From an end-user standpoint SRT is functional with libsrt in this state with a product Haivision has shipped. Therefore this is now de-facto SRT behaviour so a third-party implementation needs to implement this behaviour as a feature. So this needs to be documented somewhere.

This is literally what happened to me, a customer asked why our implementation does not work into Haivision Gateway, but libsrt does.

I would keenly help you determine the root cause of this, but I frankly believe that the case you have shown right here is that libsrt got confused and by no means it can be said that "it works". Of course, the main reason is that the receiver party did things that it shouldn't. Likely this isn't the exact case that your customer has reported. And if it's the case of libsrt working and your implementation not, this is unlikely to be due to having dropped some scheduled packets by the sender party.

ethouris commented 6 months ago

Ok, after investigation I found the explanation why there are dropped packets.

This is the sender-side TLPKTDROP functionality (so you won't see this behavior when you set SRTO_TLPKTDROP to false). Dropping too late packets is done separately on the sender and on the recevier.

On the receiver it's simply made by removing empty cells - when you have some lost packets, their cells are empty, but you may have a received packets that follows the loss gap; if that packet has a play time that has already come, and it is the first valid packet in the buffer preceded only by a series of loss, the loss is removed (packets that should be there are forgotten as "dropped") and that valid packet is delivered.

On the sender side it's done by checking the packet's timestamp recorded at scheduling time and the expected packet delivery time on the receiver side is estimated on the sender side. All packets that are considered "in the past" by this criterion will be dropped from the sender buffer without trying to send them to the receiver even once. This functionality is there in order to prevent uselessly utilizing the network to send packet that the receiver side would definitely drop. Hard to say how much it helps actually because when you receive ACK from the receiver that has dropped some empty cells already, those drop packets will be "fake-acknowledged", that is, acknowledged for the sender. But it may increase in significance in case of higher bitrates, especially so high that you can see the "lite ACK" packets appearing, which only cause removal the packets from the buffer, but do not update he receiver buffer information.

You can prevent this completely by setting SRTO_TLPKTDROP to false and on the sender side only by setting SRTO_SNDDROPDELAY to -1.

kierank commented 6 months ago

Can you confirm the Haivision gateway sends retransmissions requests in this state (of no ACK/ACKACK) and otherwise functions correctly?

ethouris commented 6 months ago

I can't speak for Haivision Gateway. I can speak for SRT that is in the use of Haivision Gateway.

If we are talking about the situuation shown in the pcap you attached, there are no retransmission requests there.

Retransmission requests are sent by the receiving party, exactly like ACKs. So if the receiver party doesn't send retransmission requests (and we state no UDP packets carrying it have been lost), it doesn't need a retransmission. In this case we have a situation that all received packets are dropped, so retransmission necessity can't be even recognized.

kierank commented 6 months ago

The packet capture is a short capture from the beginning of a session to demonstrate the loss of ACK/ACKACKs. There were no losses during these ten seconds but they can happen over time.

My point is that Haivision Gateway (with whatever SRT it runs inside, I have no idea) does send retransmissions over a period of several hours as they happen on the network. Therefore to the end user this is functional SRT irrespective of the lack of ACK/ACKACKs.

Please understand that SRT implementors have to implement libsrt's bugs in order to have a matching end-user experience. This is behaviour that Haivision has shipped to customers and so it's de-facto SRT behaviour whether you agree with it or not.

Are you able to confirm on your end that in longer running sessions that exhibit the behaviour in this ticket SRT does send retransmissions but no ACK/ACKACKs?

ethouris commented 6 months ago

The packet capture is a short capture from the beginning of a session to demonstrate the loss of ACK/ACKACKs. There were no losses during these ten seconds but they can happen over time.

Here is the beginning: image ... and the end: image

Which means that this is the whole session since the handshake up to shutdown, not a fragment of any session.

No losses happened during the whole session. There were some sender-dropped packets, but they have happened due to a stuck reader.

My point is that Haivision Gateway (with whatever SRT it runs inside, I have no idea) does send retransmissions over a period of several hours as they happen on the network. Therefore to the end user this is functional SRT irrespective of the lack of ACK/ACKACKs.

ACKs are sent every 10ms, as long as there's anything to acknowledge.

Please understand that SRT implementors have to implement libsrt's bugs in order to have a matching end-user experience. This is behaviour that Haivision has shipped to customers and so it's de-facto SRT behaviour whether you agree with it or not.

I still can't see any buggy behavior here. If you say that the receiver in this session was Haivision Gateway, then the latter is responsible for what happened here, not SRT.

SRT doesn't specify what particular party should do in response to a situation visible in the incoming data packets. That would be impossible to satisfy provided that SRT cannot guarantee that the application is using it proper way. It can only specify how particular control commands should be understood. In case of SRT, for example, ACKs are sent when packets were successfully retrieved (or dropped) and LOSSREPORTs are sent when a loss is detected. But whether any of these things happen, it depends on what happened in the application, not what happend in the protocol line.

Are you able to confirm on your end that in longer running sessions that exhibit the behaviour in this ticket SRT does send retransmissions but no ACK/ACKACKs?

A situation when ACKs are stopped being transmitted mean that that receiver buffer is full, and since that moment anything is possible and allowed. Retransmissions will be sent only when the receiver party requests them, and it will do, if it has a chance to receive a valid packet and place it into the receiver buffer.

A situation when ACKs do not come in, but LOSSREPORTs do, is something that sounds impossible for me because if the buffer is full, loss detection can't be done, and if the buffer is not full (unless the "runaway train" situtaion happened), while the packets still do come in, ACKs should still be sent.

kierank commented 6 months ago

It's a sample capture to demonstrate the behaviour made specifically for the purpose of including the handshake and shutdown. I don't see why this is hard to understand.

If you let another session with this broken behaviour run for an hour or two you will see retransmission requests but no ACK/ACKACK. From an end user point of view this is functional SRT behaviour.

I am explaining that in general the Haivision Gateway functions as normal sending retransmission requests in this scenario. Haivision has shipped a product with this and by extension it is de-facto behaviour of SRT. Note how the owner of this repository is "haivision". It's not always possible to make the Haivision Gateway behave like this, some sessions have ACK/ACKACK and some do not (I have no idea why it does this).

A situation when ACKs are stopped being transmitted mean that that receiver buffer is full, and since that moment anything is possible and allowed. Retransmissions will be sent only when the receiver party requests them, and it will do, if it has a chance to receive a valid packet and place it into the receiver buffer.

Nothing of the sort is mentioned here: https://github.com/Haivision/srt-rfc/ Nor should it be as this is your implementation specific behaviour that you describe.

A situation when ACKs do not come in, but LOSSREPORTs do, is something that sounds impossible for me because if the buffer is full, loss detection can't be done, and if the buffer is not full (unless the "runaway train" situtaion happened), while the packets still do come in, ACKs should still be sent.

This is exactly what the Haivision Gateway does. I can't reproduce the lack of ACK/ACKACK often, but if you say you can reproduce it, then you will be able to reproduce the LOSSREPORTs too.

ethouris commented 6 months ago

It's a sample capture to demonstrate the behaviour made specifically for the purpose of including the handshake and shutdown. I don't see why this is hard to understand.

Then you have provided me with a recorded pcap with a different behavior than the hypotetical one you are asking me to confirm or deny whether it conforms the specification.

If you let another session with this broken behaviour run for an hour or two you will see retransmission requests but no ACK/ACKACK. From an end user point of view this is functional SRT behaviour. ... I am explaining that in general the Haivision Gateway functions as normal sending retransmission requests in this scenario.

I can at best believe what you say because we have never seen such a behavior, nor I recall this ever been reported. Including this time.

It's not always possible to make the Haivision Gateway behave like this, some sessions have ACK/ACKACK and some do not (I have no idea why it does this).

The specification only says that ACK packets of normal (full) type are sent every 10ms. It doesn't mean simultaneously that the sender application should expect to receive them that timely. For example, if you simply stop sending data, ACKs will also stop coming in and KEEPALIVES will be exchanged.

A situation when ACKs are stopped being transmitted mean that that receiver buffer is full, and since that moment anything is possible and allowed. Retransmissions will be sent only when the receiver party requests them, and it will do, if it has a chance to receive a valid packet and place it into the receiver buffer.

Nothing of the sort is mentioned here: https://github.com/Haivision/srt-rfc/ Nor should it be as this is your implementation specific behaviour that you describe.

The protocol specification doesn't describe in which situation the receiver application should send ACK, LOSSREPORT or other packets. Only what should be understood from the fact of sending them and from the data they provide.

A situation when ACKs do not come in, but LOSSREPORTs do, is something that sounds impossible for me because if the buffer is full, loss detection can't be done, and if the buffer is not full (unless the "runaway train" situtaion happened), while the packets still do come in, ACKs should still be sent.

This is exactly what the Haivision Gateway does. I can't reproduce the lack of ACK/ACKACK often, but if you say you can reproduce it, then you will be able to reproduce the LOSSREPORTs too.

The problem is that what I have reproduced was the behavior you provided a pcap for, and it's the behavior in which the receiver doesn't read packets. If you have UDP packets dropped on the link, then definitely you'll see LOSSREPORTs, but only as long as the packets are being received (inserted into the receiver buffer).

I can imagine a situation when you have a loss that was not recovered until the receiver buffer gets full, which may lead to a situation, in which ACKs can't be sent back anymore, but LOSSREPORTs still do come in because they have been once detected - before the buffer was full, so a packet could be inserted into the buffer and the gap detected - and then the loss was not successfully sent for some time, and LOSSREPORT was sent again, possibly multiple times, according to the behavior of "periodic nakreport". This could be really unique and involve a really bad luck of having a retransmitted packet lost again every time. Normally, with TLPKTDROP on, it could result in dropping this packet on the sender and then LOSSREPORT would be responded with DROPREQ. If the packet is dropped on the receiver then, the next ACK will acknowledge it - but dropping is only possible if packets are being read by the receiver application.

In other words, the receiver application is free to send or not send any kind of control packets with no reason, except that it is obliged to not lie about the situation that particular control packet claims to confirm (this also includes not sending a packet, as it's in the case of KEEPALIVE, for example). The SRT protocol implementation that reads these packets is only required to properly understand them in case they are received, but it can't expect any kind of packet to be received in any situation. This is UDP and every packet may be potentially dropped on the link and the application receiving them should be prepared to have any kind of packet missed.

kierank commented 6 months ago

I am so happy we wrote our own implementation. I have no words, really, no words.