str0m locally nominates host candidate, remote peer nominates srflx candidate

thomaseizinger commented 3 months ago

We've been running into a bug caused by invalidating remote candidates. I think it might be worth investigating but I also want to say that I've changed my mind on how I want roaming to work and we will adopt ICE restarts soon.

It might also not be a bug but it is somewhat unexpected behaviour: A remote and local agent have a different view on which candidates are nominated when it comes to host vs srflx candidates.

A STUN binding sent from a host candidate will arrive at a remote's srflx candidate and produce a response returning back the observed address. When processing the response, str0m will use the transaction ID to find the original candidate. Here is where the state disparity starts:

The remote thinks it is talking to our srflx candidate.
We think we are using our host candidate.

As long as candidates don't get invalidated, this is somewhat wrong but works fine because a srflx candidate's base is a host candidate's address. Once I send all my "unused" candidates to the other party to invalidate them, the connection will break because I send my srflx candidate which the other party nominated.

thomaseizinger commented 3 months ago

I think the tricky part here is working out, what we want to happen. Matching the transaction IDs for the response is obviously what we need to do. We also already record the candidate which we received a response on. It is stored in valid_idx but this index isn't used for anything.

Should we be nominating that one instead of the candidate we originally sent from?

algesten commented 3 months ago

The way to order the candidates (calculating prio) is determined by the spec. That prio order should be enough to get connectivity. I can see how the invalidate remote candidate call can upset that in the situation you describe.

If you are not continuing the invalidate remote route, I would rather drop the API than try to fix it. Does that make sense?

thomaseizinger commented 3 months ago

If you are not continuing the invalidate remote route, I would rather drop the API than try to fix it. Does that make sense?

Yeah I can understand that. Still, I think there is a bug currently that might surface again in other situations. str0m thinks it is sending via the host candidate when in reality, it is connected via its server-reflexive candidate.

thomaseizinger commented 3 months ago

str0m thinks it is sending via the host candidate when in reality, it is connected via its server-reflexive candidate.

In particular, the PairId that gets nominated will point to the host candidate but has its valid_idx set to the server-reflexive candidate. That valid_idx is never used for anything which appears to be a bit dodgy too because the comment referenced some "valid candidate" logic that I can't find anywhere.

algesten commented 3 months ago

Googling around a bit, I think the behavior is correct. The spec says https://www.rfc-editor.org/rfc/rfc5245#section-5.7.3

Since an agent cannot send requests directly from a reflexive candidate, but only from its base, the agent next goes through the sorted list of candidate pairs. For each pair where the local candidate is server reflexive, the server reflexive candidate MUST be replaced by its base.

In some respects I don't think server reflexive candidates should be added to the local ICE agent at all. It should only know about the host candidates.

Maybe we split off the valid_idx issue separate?

thomaseizinger commented 3 months ago

Since an agent cannot send requests directly from a reflexive candidate, but only from its base, the agent next goes through the sorted list of candidate pairs. For each pair where the local candidate is server reflexive, the server reflexive candidate MUST be replaced by its base.

In some respects I don't think server reflexive candidates should be added to the local ICE agent at all. It should only know about the host candidates.

But instead only send them as a candidate to the remote peer?

algesten commented 3 months ago

But instead only send them as a candidate to the remote peer?

Yes!

thomaseizinger commented 3 months ago

But instead only send them as a candidate to the remote peer?

Yes!

I hadn't really considered this but I think this could be a solution too, let me try.

If I never add the server-reflexive candidate locally, only ever the host candidate can be nominated so it would be unambiguous.

algesten commented 3 months ago

It probably works.

I think with WebRTC we are adding them locally as well, however the treatment of them wrt the spec is at least directionally correct. Though we are not doing the literal "replace by its base" – could be some inefficiency here.

thomaseizinger commented 3 months ago

In some respects I don't think server reflexive candidates should be added to the local ICE agent at all.

Should the agent reject adding those as local candidates? Or do you think that goes too far?

algesten commented 3 months ago

I was just pondering that.

I wonder if there is a case when only adding server reflexive and no host candidates. I think on balance we should allow it. Looking at form_pairs, it seems we at least attempt to avoid making redundant pairs if the bases are equivalent.

thomaseizinger commented 3 months ago

I wonder if there is a case when only adding server reflexive and no host candidates.

Did you ever consider an API where candidates are created on an agent? Like:

impl IceAgent {
    fn new_host_candidate(&mut self, addr: SocketAddr, protocol: Protocol) -> Candidate;
    fn new_relay_candidate(&mut self, addr: SocketAddr, protocol: Protocol) -> Candidate;
    fn new_server_reflexive_candidate(&mut self, addr: SocketAddr, base: SocketAddr, protocol: Protocol) -> Candidate;
}

That would give us more control over, where they are stored internally and for example, we could emit a warning if somebody adds a server-reflexive candidate without also adding a host candidate for the same base.

algesten commented 3 months ago

I guess I never thought much of the ICE agent in isolation from the RTC stuff. Rtc::add_local_candidate() vs Rtc::add_remote_candidate().

algesten commented 3 months ago

Maybe rather than invalidate_candidate you want invalidate_base?

thomaseizinger commented 3 months ago

Maybe rather than invalidate_candidate you want invalidate_base?

I'll try and implement ICE restarts first and see where that leads us!

thomaseizinger commented 3 months ago

Closing this as a non-issue. Should we add a warning to add_local_candidate if we encounter a server-reflexive candidate?

algesten commented 3 months ago

Should we add a warning to add_local_candidate if we encounter a server-reflexive candidate?

I think the current behavior is correct. For the Rtc case, it would be quite common to throw in any local candidate. This code is where we ensure that we don't form redundant pairs. If you add both the host and server-reflexive, you'll end up with only the higher prio as a pair.

If you add one, start communicating, and later add the other, the nominated pair swaps out and it still works because the abstraction isn't telling you which exact candidate the agent is using but rather instructions on how to communicate ("send from this address to that address").

I'd like to document this better. Given the journey you guys been on, there clearly is something missing in communicating how the agent works. Maybe it's just better doc on the NominatedSend, but could also be on the whole module explaining the operation.

algesten commented 3 months ago

Also. There has been a lot of rediscovering my own thoughts here. I simply don't recall how I arrived where it is today. Getting old I suppose :(

thomaseizinger commented 3 months ago

I'd like to document this better. Given the journey you guys been on, there clearly is something missing in communicating how the agent works. Maybe it's just better doc on the NominatedSend, but could also be on the whole module explaining the operation.

Some of this is definitely my fault in just trying to use IceAgent and not learning about ICE properly first :) For example, the whole idea of invalidating selective candidates was a pretty wild-goose chase :D I thought I can be clever in not needing to do ICE restart but in reality, there are so many combinations of network changes that are way too hard to cover if you try to selectively "patch" your local state. Ultimately, you need to start gathering new candidates and do the selection process all over again to get a working connection.

Some peer-to-peer ICE docs or an example probably wouldn't hurt though.

algesten / str0m

str0m locally nominates host candidate, remote peer nominates srflx candidate #525