algesten / str0m

A Sans I/O WebRTC implementation in Rust.
MIT License
335 stars 50 forks source link

Fix RTX stops working after packet loss spike #566

Closed alexlapa closed 2 months ago

alexlapa commented 2 months ago

So i've noticed remote receivers sending PLIs too often even after minor packet loss. Debugging NACKs and RTXs showed that everything actually works fine for some time and at some point receivers start flooding SFU with NACKs for that same packets again and again and no RTXs are being sent at all which results in receiver ending up with a PLI request.

So whats going on is:

  1. At some point packet loss spikes and RTX ratio exceeds 15%.
  2. This results in poll_packet_resend always returning None because of this
  3. Since no resends are made it never get to this point, so RTX ratio is not being recalculated here.
  4. At this point it is basically stuck with RTX ratio locked at > 0.15 with the only option of it being fixed is if bytes transmitted sum becoming high enough which does not normally happen.

So i believe that ratio recalculation is supposed to happen during the ValueHistory::sum() call.

And here is before and after videos of how that works:

Before [before.webm](https://github.com/user-attachments/assets/0da802e3-16cd-44ad-a86e-819c4a3aafae)
After [after.webm](https://github.com/user-attachments/assets/d755d152-e803-49bc-a4af-8794203b228d)

On the after video tx sum drop is when SVC kicks in so ignore that.

Another thing is it would be great to be able to configure that 0.15 RTX cache drop ratio, mind if i do it in another PR?Probably as an additional argument to StreamTx::set_rtx_cache, so it will be fn set_rtx_cache(&mut self, max_packets: usize, max_age: Duration, rtx_cache_drop_ratio: Option<f32>) with None completely disabling this mechanic.

algesten commented 2 months ago

@alexlapa great find! Super nice improvement!

For sanity, I think we should change the impl a bit differently. It seems intuitive that the ValueHistory would be pruned on push, which already is a &mut and that's where I'd expect it to happen. However, as you shown, that's not a good idea.

Having it prune on sum() seems a little less intuitive. You could maybe argue that you want "a sum up until this point", but to me "calculating a sum" does not sound like a mutation.

I would prefer you drop the pruning from both push and sum, and expose drain, to make it explicit at which point this happens.

I'd also rename and change the signature of drain (since that has a specific meaning in rust std api). Something like purge or purge_old. It should also not return Option<()> since that value is of no consequence to the caller, it's simply there to allow one use of ?.

alexlapa commented 2 months ago

@algesten ,

And what your opinion on this?

Another thing is it would be great to be able to configure that 0.15 RTX cache drop ratio, mind if i do it in another PR? Probably as an additional argument to StreamTx::set_rtx_cache, so it will be fn set_rtx_cache(&mut self, max_packets: usize, max_age: Duration, rtx_cache_drop_ratio: Option) with None completely disabling this mechanic.

algesten commented 2 months ago

@alexlapa i like it. There's some overlap with functions that can disable RTX depending on the SDP negotiation, but I think it's worth a PR.