Slow ICE connection flow compared to libwebrtc

morajabi commented 1 year ago

It's not exactly an issue, but I want to start a discussion to find ways we can speed up the connection when there are multiple ICE candidates (i.e. ice_lite disabled). Although this benefits primarily the p2p use-case, but as previously mentioned I'd like to improve the ICE agent in str0m and we can start here. Because this is critical for our app.

Context: In a p2p application we'd loop over network interfaces, add each one as host and then start adding srflx and relay candidates. st0rm connects instantly if the host candidate works. But when over a network it seems like each added candidate that doesn't work adds delay to the connection. This delay is very noticeable when ICE agent needs to go over 4-5 candidate pairs to connect.

In my unscientific tests, I manually started a call via 2 libwebrtc peers and 2 str0m peers with the same STUN and TURN servers configured. str0m took 5x the time libwebrtc took to connect.

What do you think can be the issue? Are we going over candidates sequentially?

algesten commented 1 year ago

Thanks for raising this.

I don't think there's anything deliberately slowing things down. I think all pairs are tested at the same time.

To explain what's happening. As you know, you add ice candidates, both local and remote.

Local candidates are combined with remote candidates into pairs. Pairs are considers differently good, a direct host-host connection is better than going through turn servers.

Once a pair is formed, we start making STUN requests with that pair as sender/receiver.

If a STUN requests goes through and we receive an answer, the pair is a success. We nominate the pair as the active one.

The best prio successful pair "wins".

The easiest way to understand why this takes time is to turn on TRACE or add println.

pair.rs combines a pair of candidates. That's a good starting point to println and understand why this takes time.

Link me any code that doesn't make sense and i'll explain what it does.

morajabi commented 1 year ago

@algesten excited to work on debugging this, thanks for the info!

morajabi commented 11 months ago

I'll post updates as I test small changes and measure in production app

I started by reducing the MAX_RTO from 3000 to 1500, it helped as I was seeing success in 3-4 attempts.
I noticed the process after a nomination success -> connected state (after DTLS is established and all) is sometimes >%30 of the connection time (measured since first nominated pair).

algesten commented 11 months ago

Nice finds!

I started by reducing the MAX_RTO from 3000 to 1500, it helped as I was seeing success in 3-4 attempts.

Let's double check this against libWebRTC. I don't think there's a problem lowering it, but that also means more STUN packets being sent in a short time.

I noticed the process after a nomination success -> connected state (after DTLS is established and all) is sometimes >%30 of the connection time (measured since first nominated pair).

This could potentially be the certificate generation DtlsCert::new(). This is why @xnorpx made RtcConfig::set_dtls_cert() so that new certificates can be made ahead of time, or in another thread at the same time as starting the STUN negotiation.

morajabi commented 11 months ago

Let's double check this against libWebRTC.

Yes, I can't say for sure this helps until I check every other variable. I'm going to start with generated DTLS cert beforehand. Thanks! One question: is it safe to use one certificate for multiple connections or I should make a pool?

algesten commented 11 months ago

Thanks! One question: is it safe to use one certificate for multiple connections or I should make a pool?

They are strictly use once, or you're opening up a security hole. Hm. I see it's Clone. That's no good. I'll fix that now.

algesten commented 11 months ago

@morajabi here it is https://github.com/algesten/str0m/pull/415

pthatcher commented 11 months ago

Some things that libwebrtc does to connect fast that str0m should probably do:

Send an ICE check every 5ms
Send triggered checks (something in the spec; prioritize sending a check on a pair when you receive a check)
Send a check on pairs "most likely to work first" (pairs with relayed candidates) (not sure if this is on by default...)

thomaseizinger commented 9 months ago

A few more thoughts:

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.
I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.
It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

algesten commented 9 months ago

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.

This would mean both sides effectively have the same IP address? Could that be generalised to "same IP" regardless of type of candidate?

I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.

I'm probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn't host <> relay be the most likely then? It's quite different to peer-peer.

Or taking a step back, why would removing any pairs be an advantage? Less noise?

It would be useful if str0m would eventually stop testing certain pairs once we have connectivity. Something like: Okay, we hole-punched a connection, lets keep another pair based on relay candidates active but stop testing the others? Or in the case of a relayed connection, keep another one as a backup (different IP) in case this TURN server becomes unavailable.

Sure. Let's discuss possible strategies on Zulip.

thomaseizinger commented 9 months ago

I think testing host <> relay candidates is very unlikely to be useful. We may even want to drop those candidate pairs altogether? TURN servers are deployed to the public internet, attempt to reach them via our host candidate is essentially the same as trying to reach them from our server-reflexive candidate. Unless I am missing something, host candidates are mostly useful for hole-punching & direct LAN connections so we should de-prioritize or even omit pairs of host <> relay.

I'm probably missing something, but… our standard use case for an SFU, is a server with a public IP and clients behind NAT, firewalls etc. Wouldn't host <> relay be the most likely then? It's quite different to peer-peer.

Hmm, several questions:

The local interface of your service is publicly reachable? That is a very unusual setup, no? On most cloud providers, your local address isn't your public address.
If your service is publicly reachable, you don't need any relays and a direct connection to the host / server-reflexive candidate will work.

Directly talking from a host candidate to a relay implies relay and your node are in the same subnet. If the client can reach the relay, it should also be able to reach the node.

I think what typically happens is that sending from a host candidate ends up being the same as sending from the server-reflexive candidate because your routes are configured to forward to the next router, out of your current subnet.

Perhaps one rule could be: If we discover a server-reflexive candidate that has another host candidate as the base, don't bother forming pairs for the host candidates?

thomaseizinger commented 9 months ago

Candidate pairs on the same relay should be prioritzed over pairs on different relays to reduce latency (1 less hop). If we find a candidate pair on the same relay, we may even want to stop testing the others.

This would mean both sides effectively have the same IP address? Could that be generalised to "same IP" regardless of type of candidate?

Yeah I think it is safe to assume that a relay doesn't share an IP with another service so same IP should mean same relay.

I am not sure generalising makes sense. Two nodes might have the same server-reflexive IP. That means they should be reachable via their host candidates.

thomaseizinger commented 9 months ago

Or taking a step back, why would removing any pairs be an advantage? Less noise?

Less noise
Less resources being consumed (i.e. TURN channels can be expired if we stop using them)

thomaseizinger commented 9 months ago

Relevant: https://github.com/algesten/str0m/pull/476

thomaseizinger commented 9 months ago

Relevant: #476

For anybody following along, the issue turned out to be a combination of:

Not refreshing our local timer from poll_timeout after changing the agent's state (i.e. adding candidate)
str0m having a minimum advance of 50ms in handle_timeout: https://github.com/algesten/str0m/pull/477

With both of these fixed, I am getting similar results as in #476: str0m needs about 350ms from the changing the state to Checking until the first pair is nominated. This is to a server in the US from Australia so with better latency, I'd assume it is even less.

algesten / str0m

Slow ICE connection flow compared to libwebrtc #405