Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.06k stars 840 forks source link

[BUG] (Group) Sequence discrepancy with denied repair should result in closing the failing link #1575

Open ethouris opened 3 years ago

ethouris commented 3 years ago

The problem:

It may happen due to not exactly clean conditions to happen that the overriding of the sequence needs to be done. However, if the overriding would go in wrong direction, it is denied. This is checked in this part of code:

    // Both the scheduling and sending sequences should be fixed.
    // The new sequence normally should jump over several sequence numbers
    // towards what is currently in m_iSndCurrSeqNo.
    // Therefore it's not allowed that:
    // - the jump go backward: backward packets should be already there
    // - the jump go forward by a value larger than half the period: DISCREPANCY.
    const int diff = CSeqNo(seq) - CSeqNo(m_iSndCurrSeqNo);
    if (diff < 0 || diff > CSeqNo::m_iSeqNoTH)
    {
        LOGC(gslog.Error, log << CONID() << "IPE: Overridding with seq %" << seq << " DISCREPANCY against current %"
                << m_iSndCurrSeqNo << " and next sched %" << m_iSndNextSeqNo << " - diff=" << diff);
        return false;
    }

It has been observed sometimes that this override is attempted to happen. This denial must be recognized and reacted, as a link for which this override wasn't done, would be unusable and could cause a break for the whole group. Either we could fix this denial somehow (fix anyway, at least for some narrower range), or if not possible, then the link that was denied overriding must be immediately closed, at least for security reasons (this sequence discrepancy will be detected again and this whole procedure will be repeated).

ethouris commented 3 years ago

The likely cause of the problem:

In the BACKUP groups the IDLE links get updated with the sequence numbers so that when they become activated they send the same packet as the other link with the same sequence number. In general this synchronization happens in three places:

  1. At the connection time, it gets the sequence number of the currently send packet. Some time will surely pass since the moment when this very link will get ready for transmission, so a later override will be still required.
  2. At the time of activation of the link - the sequence number is shifted, likely forwards.
  3. When the link is IDLE, it is periodically (periodic-keepalive) updated with the about-to-send sequence number so that it is "on track" when it happens that it needs to be activated (BACKUP groups only - in BROADCAST and BALANCING groups an activated link is active till its death).

Suspected is the procedure that does the synchronization in a situation when the sender buffer contains more packets than the distance between the last synchronization at connection or periodic-keepalive time, which means that the next packet that is going to be sent over the activated link has a sequence number BACKWARDS towards the current scheduling sequence number that resulted from synchronization.

There are two possibilities of how to fix it:

  1. Allow getting backwards in overrideSeqNo function, just limit the value by which the sequence numbers may differ.
  2. In case of backup group sender, synchronize the sequence number to the value of the last ACK-ed packet (or, more precisely, to the sequence number of the first packet in the group's sender buffer), NOT to the sequence number of the packet about to be sent, as it is now.