Glimesh / janus-ftl-plugin

A plugin for the Janus WebRTC gateway to enable relaying of audio/video streams utilizing Mixer's FTL (Faster-Than-Light) protocol.
https://hayden.fyi/posts/2020-08-03-Faster-Than-Light-protocol-engineering-notes.html
GNU Affero General Public License v3.0
44 stars 11 forks source link

🔙 Improved NACKs #135

Closed danstiner closed 2 years ago

danstiner commented 2 years ago

Checklist

Background

Shoutout to Hayden for writing the original NACK (negative acknowledgement of missing packets) feature. It mostly worked but had some issues with sequence number wrap-around (to be fair, I introduced a number of bugs when I touched the NACK code recently). If I remember correctly streams could break if there was significant packet loss (and would not recover when the packet loss went away). This left the feature unusable and it was turned off.

The NACK feature has always been difficult to test and is tied together with the keyframe packet collection. This PR is my second take at fixing the NACK feature by keeping the core ideas of the old code while starting to pull the logic into separate classes for clarity and test-ability.

Goals

I think meeting those goals will yield a "good enough" solution, but if I had unlimited time I'd aim for something frame-aware that can do things like prioritize keyframe packet loss and estimate when a packet is "late" based on the frame timestamps, jitter estimates, etc. Those are all things the client does with it's "playout buffer" data structure, we may be able to take inspiration from that and adjust the ideas to be relevant on the server.

Status

Tested locally and as of 0d23274 it seems effective and stable. Overnight stability tests show it is capable of reducing 1% packet loss to about 0.001%, which is exactly (1% * 1%) as expected (re-transmitted packets can still be lost, the 1% loss rate in this local test also applies to them and this code does not try to re-transmit a second time). This should be the difference between bad frames every couple seconds to more like once a minute.

Details

There are roughly two parts to the tracking and NACK process:

First, when a packet arrives it is both fed to the sequence tracker instance for the source and relayed immediately to all client sessions. This follows the previous behavior of this code, no latency is added to packet relaying. The sequence tracker checks if the sequence number is a re-transmit due to a prior NACK. If so, that NACK is marked as received and it does nothing else. If this is a new packet, it is added to an internal buffer, in sequence order.

Second, when it's time to send NACKs, we iterate through the buffer of received packets and look for gaps in the sequence numbers. Any gaps are marked as "missing" packets and NACKs are sent for the missing packets. As mentioned above, when NACK'd packets are re-transmitted and received they are removed from the missing list. If we never get the re-transmit, eventually we time out the NACK and stop tracking the missing packet.

That's the jist of it, but there are some additional complexities:

Also, snuck in a change to the pending/current keyframe packet collection logic in FtlMediaConnection. It now checks for sequential packets without gaps and that the final packet has a marker bit before it considers a keyframe valid. This should prevent most cases of corrupt frames being used at stream thumbnails.

Closes #129

Potential Future Improvements

clone1018 commented 2 years ago

Hey @danstiner, I deployed this to the infrequently used do-sfo3-ingest1.ksfo.live.glimesh.tv server and was able to test NACKs working successfully on a minimally unstable network. I'll keep it running on the server for a bit if you want to do any testing on it. Once you're comfortable we can get some more servers / users testing out this new improvement!

Thanks again for your hard work on this!

danstiner commented 2 years ago

Thanks for deploying @clone1018, good to see you and excited to see this out in the wild! Looks like SmashBets is using it right now, should ask them for feedback. I also tested a bit and can confirm it appears to be helping. It isn't as good in the real world as in my local testing, sometimes re-transmitted packets arrive too late and video/audio skips still can happen, but it is an improvement over the current behavior I think.

I don't have any issues with letting do-sfo3-ingest1 sit for a bit until Hayden reviews this. Then after merge the change can slowly be rolled out more widely. However, I'd love to also apply the playout-delay patch at some point, maybe as a combined rollout. Something like a 400ms delay should both enhance this change by giving more time for late arriving re-transmits and should also help with the issue of large keyframes causing stutters.

Test 1

This chart from chrome://webrtc-internals/ when watching https://glimesh.tv/SmashBets shows video packets being marked as lost by Chrome and then unmarked when the re-transmit arrives. The loss is not completely mitigated but this change is helping quite a bit:

image

A similar thing happens for audio but due to issues discussed below the audio packet re-transmits arrive too late and are discarded anyways most of the time. This is unfortunate, but due to a few factors this loss is generally not very noticeable. image

Test 2

For the fun of it I also did some packet capturing. From the viewer side, the following is an example of where Seq 62850-62852 for SSRC 0x5EFD5FAE did not arrive initially:


    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1029968   22:23:31.353413   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62848, Time=1046226000        
  1029969   22:23:31.353438   RTP          1437   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62849, Time=1046226000, Mark  

  1029989   22:23:31.473773   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62853, Time=1046229000        
  1029990   22:23:31.473806   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62854, Time=1046229000        

Then from the streamer's side, OBS logs show we got the NACK from do-sfo3-ingest1 and it re-transmitted those three missing packets. Side note: Client/server time are both in UTC but not quite in sync, that's pretty normal and does not hurt anything.

22:23:31.295: [3] [12677] resent sn 62850, request delay was 196 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62851, request delay was 197 ms, was part of iframe? 0
22:23:31.295: [3] [12677] resent sn 62852, request delay was 197 ms, was part of iframe? 0

Then a bit later on the viewer side we see the re-transmitted packets arrive just after Seq 62929, about 190ms after they originally would have arrived, which matches the OBS logs. In this case it is likely the re-transmit arrived in time and there were no lost frames given my jitterBufferDelay/jitterBufferEmittedCount_in_ms was >400ms. However the jitter buffer delay can vary quite a bit depending on the viewer's network conditions, so that won't always be true. One way to mitigate the variance would be to apply my playout-delay patch with a minimum playout delay of 400ms+. That should be enough in cases like this to ensure the re-transmitted packets arrive in time.

    No.           UTC         Protocol   Length                                     Info                                    
 --------- ----------------- ---------- -------- -------------------------------------------------------------------------- 
  1030150   22:23:31.539324   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62928, Time=1046245500        
  1030151   22:23:31.539324   RTP           539   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62929, Time=1046245500, Mark  

  1030163   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62850, Time=1046227500        
  1030164   22:23:31.541572   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62851, Time=1046227500        
  1030165   22:23:31.541572   RTP           724   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62852, Time=1046227500, Mark  

  1030175   22:23:31.568032   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62930, Time=1046247000        
  1030176   22:23:31.568067   RTP          1462   PT=DynamicRTP-Type-96, SSRC=0x5EFD5FAE, Seq=62931, Time=1046247000        

A future improvement would be to lower the re-transmit delay. This can easily be tracked in OBS logs by looking for the request delay part of "resent" log lines. In my OBS logs with delay in the 500-750ms range, which is probably too slow to be of any use because the video will have already advanced past that frame by the time such a late re-transmit arrives.

I suspect the long re-transmit delay is mostly a problem for low motion p-frames (~3 packets per frame) and audio (~1 packet per interval). That is because the current code waits until a certain number of newer packets (currently sixteen) have been seen before considering a packet lost. This is to allow for out-of-order packet arrival. My initial thought is to do a time-based calculation of when a packet is missing/late. Ideally we should be able to send the NACK and get the re-transmit in under 100ms. That allows for a 30ms reorder window + transmit latency of 15ms + 16.6ms frame at 60fps + 30ms round trip on the re-transmit. However, the code for that calculation is a fair bit more complicated than "wait sixteen packets", so it's something to address in a future PR.

clone1018 commented 2 years ago

Sounds like a good approach to me. It is worth mentioning that smashbets has notoriously finicky internet when it comes to FTL, even having documented issues going back to the Mixer days. It's a good test, but will sometimes have significantly more packet loss than this case can cover.

I have no problem deploying the playout-delay as well, but a combined roll out seems easiest from a downtime perspective. Do you have any tests you want to run on production specifically for playout-delay? We could always deploy it to one server for now, and then a combined rollout later.

danstiner commented 2 years ago

Yeah I worked with SmashBets some on their connection issues, but it looks like they are more stable at a higher resolution/bitrate than they could do before so that's good news! Even if it isn't perfect.

Sounds good on deployment, deploying playout-delay to just one edge to start with would be good. I opened a quick draft PR to document the steps needed to deploy / have a place to discuss deployment stuff: https://github.com/Glimesh/ops/pull/41