fix(ice): do not disconnect whilst we still check new candidates

thomaseizinger commented 6 months ago

In case the active candidate pair gets invalidated, str0m currently instantly emits a Disconnected event. For example, if a candidate pair with a peer-reflexive candidate gets nominated and later replaced. This results in a disconnected event despite us not actually being disconnected (a peer-reflexive candidate that gets replaced is after all the exact same socket).

To avoid this false-positive disconnection, we transition back to Checking in case we are still evaluating other candidates.

thomaseizinger commented 1 month ago

@algesten What is your appetite for merging this? I am hitting this now again in the context of a peer-reflexive candiate getting replaced with a regular one.

In case the peer-reflexive candidate was already nominated, we will flip to disconnected despite us checking the new candidate.

I remember that your latest stance was that we should "wait" a bit before acting on Disconnected to see if we recover from it. Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

The consequence of this behaviour is a flaky integration test on our end:

Peer 1 starts sending STUN requests before the candidates arrive at peer 2
Peer 2 creates a peer-reflexive candidate
Peer 2 nominates the candidate
Peer 2 receives actual candidates from peer 1
Nominated pair gets replaced, causing ICE timeout in peer 2
Peer 1 runs into ICE timeout because peer 2 discarded the agent as a result of the timeout

algesten commented 1 month ago

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

(from #486)

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

My preference is to stick close to WebRTC, but if it's super important to you, I don't think this PR would break anything, so we could potentially land it.

xnorpx commented 1 month ago

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

If our clients disconnect through other means we tear down the connections directly but not with ICE disconnect.

thomaseizinger commented 1 month ago

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

100%. Reliable roaming is something that we really care about. Especially because for relayed connections, deploying new relays is basically like roaming so we need this to work reliably to not interrupt our users every time we deploy. The way it works is that the client applications listen for system-specific events whenever network interfaces change. When that happens, we reset all connections and establish new ones, i.e. we discard all IceAgents and rebind our sockets to new ports (and potentially new interfaces).

Because our application is an IP-tunnel, this means the layered connection is essentially "reconnecting" (i.e. TCP packets will start to get retransmitted by the kernel etc). We use those packets as "intents" that we need to set up a new ICE connection.

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

I don't think it is against the ICE spec per se. Admittedly, I've not dealt with it as much as you have but in my experience, ICE isn't very good at repairing connection state during roaming or when network conditions actually change, at least not without being told about it. An ICE restart is basically what we are doing except that we just make an entire new agent to begin with. What happens in most cases is that the network link is dead and you can only detect it because X STUN messages have timed out.

For those cases where it is really just intermittent packet loss, the bit that we actually care about is for how long we tolerate that. This is where we get to this:

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

The bit that I don't understand with this approach is: What is the difference between the configured STUN timeout + adding 30 seconds on top vs re-configuring the IceAgent timeouts to give you the duration you want?

str0m's default are 9 STUN messages with a 3s interval and after 50% missing responses we probe more quickly (or something like that). Effectively, it results in a timeout of ~12s I think. For us, that is a bit too long, so we tweak those values in our application. If you can tolerate a longer partition, then you can tweak them upwards and allow up to 15 or 20 missed STUN requests.

Eventually though, there needs to be a timeout. It seems redundant to me to re-implement this on top of str0m when str0m has it already built in.

The main thing that this PR is changing is that:

if we have a nominated candidate pair
and we invalidate that for whatever reason (because it e.g. was a based on a peer-reflexive candidate)
and we still have candidate pairs whose status we don't know yet

Then we transition back to Checking instead of Disconnected. I'd argue that is reasonable because it means str0m will operate more deterministically and not emit Disconnected as a result of slight timing differences in how STUN requests vs candidates arrive.

thomaseizinger commented 1 month ago

If this PR is not desired, would you accept a PR that exposes a more fine-granular state on the IceAgent that allows me to check whether or not we are still testing candidates? Then I could implement a check myself to wait for Completed before acting on the Disconnected event.

algesten commented 1 month ago

Let's land this PR. I think it's ok.

thomaseizinger commented 1 month ago

Let's land this PR. I think it's ok.

Awesome! Would you like me to add a test for it?

Also, I am just changing over our dependency to the fork to test drive this a bit more. We used to have this in our fork and the tests were pretty stable so I am pretty confident that the flakiness come with depending on str0m mainline a couple of days ago.

Better be sure though. Will ping once ready!

xnorpx commented 1 month ago

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

thomaseizinger commented 1 month ago

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

I added a failing test. The setup is a bit peculiar because we need an additional set of candidates. It should demonstrate the problem nicely though.

Regarding the documentation, I think we may need an additional PR to expose the ice timeout config added in https://github.com/algesten/str0m/pull/537 on the RtcConfig before that can be used there. Unless you are using the IceAgent directly too, then you can just use the exposed config options.

algesten commented 1 month ago

Looks good! Let's merge it!

algesten / str0m

fix(ice): do not disconnect whilst we still check new candidates #489