Configure RTX ratio cap

alexlapa commented 1 month ago

https://github.com/algesten/str0m/pull/566#issuecomment-2377171154

Another thing is it would be great to be able to configure that 0.15 RTX cache drop ratio, mind if i do it in another PR? Probably as an additional argument to StreamTx::set_rtx_cache, so it will be fn set_rtx_cache(&mut self, max_packets: usize, max_age: Duration, rtx_cache_drop_ratio: Option<f32>) with None completely disabling this mechanic.

lolgesten commented 1 month ago

When would we want this to be None?

alexlapa commented 1 month ago

You mean why not just use value bigger than 1.0? It's just seems to be a more obvious off switch. I dont mind changing this if you want.

Or do you mean why would anyone want to disable this functionality? Well, i consider disabling it since its up to receiver to decide whether it need some specific packet or not, however I'm still testing this.

It is not expected that a receiver will send a NACK for every lost RTP packet; rather, it needs to consider the cost of sending NACK feedback and the importance of the lost packet to make an informed decision on whether it is worth telling the sender about a packet-loss event.

algesten commented 1 month ago

@davibe can provide more context, but I think this was put into place for really bad connections, where we risk crowding the network with resends rather than real packets. I think there's definitely a case where the client keeps nacking so much that it makes no sense to just try and fulfill that.

alexlapa commented 1 month ago

I think there's definitely a case where the client keeps nacking so much that it makes no sense to just try and fulfill that.

Sure, i can image a situation when receiver has a good uplink so NACKs make it to the server but bad downlink so both TXs and RTXs are getting lost, sure. But also there are cases when receiver can stretch jitter buffer to a few seconds in none delay sensitive scenarios.

And i don't really understand why would you want to forbid this option. If this PR is allowing to configure this cap then why not. Do you want to assert that its in, lets say, [0.15..0.5] range?

davibe commented 1 month ago

Packets can be lost due to bandwidth constraints (or be or too late, which is the same). Retransmitting "too much" uses additional bandwidth which further exceeds the connection capabilities causing even more loss. This can happen also in scenarios that are not delay-sensitive because bandwidth may become not-enough indefinitely.

In our internal products - without this arbitrary 0.15 limit - we found that varying network conditions would get the peer stuck in a loop where str0m would continuously flood them with retransmissions and big keyframes while they kept nacking and sending PLIs as they were unable to recover.

Maybe what we need is to evolve this 0.15 limit into something that is more elaborate and takes more things into account (what is the bw of the peer, how much are we overusing, are we about to send a keyframe, ...). In absence of a more elaborate logic, configurability may be useful to some.

I thought 15% was quite a high limit, but public wifis may surprise. If a network has more than 15% loss but can still sustain high speeds, then 15% is not the best option. This is a dynamic per-peer consideration. This PR seems to allow to change the limit dynamically via direct api but it does not handle the change smoothly (the rtx cache is dropped on reconfiguration). Maybe it's enough for now.

alexlapa commented 1 month ago

In our internal products - without this arbitrary 0.15 limit - we found that varying network conditions would get the peer stuck in a loop where str0m would continuously flood them with retransmissions and big keyframes while they kept nacking and sending PLIs as they were unable to recover.

Yeah that looks like bandwidth issue. Mobile networks maybe? My issue with 0.15 cap is actually normal bandwidth clients that still have some sporadic bursts of ~80-180 packets lost for a single RTP stream. And increasing the cap makes it so there is no PLI and just a minor ~100ms freeze. And for the low bandwidth there is SVC, which does its job pretty good.

I thought 15% was quite a high limit, but public wifis may surprise. If a network has more than 15% loss but can still sustain high speeds, then 15% is not the best option.

Well it depends on the time frame. From what i see, average packet loss might stay under 1% but the maximum over 1 second might exceed 50% sometimes.

UPD:

I guess that in the specific case that you have described it would make sense to clear not only resends queue but also RtxCache, if you really want to completely reset receiver leaving him with no other option but requesting a keyframe.

alexlapa commented 1 month ago

@davibe ,

Is there still any issues with this PR?

algesten commented 1 month ago

Thanks for the reminder. I'll do a final review later today.

algesten commented 1 month ago

Thanks!

algesten / str0m

Configure RTX ratio cap #570