Better handling of discarded candidates

thomaseizinger commented 5 months ago

We running into some state mismatch bugs with discarded candidates. Original issue is: https://github.com/firezone/firezone/issues/4736.

What seems to happen is that, after a candidate is discarded, it cannot be added again because it is considered redundant although the old candidate is marked as discarded and thus excluded from certain "searches" for candidates.

For example, in our code, we try to find the nominated candidate after it has been nominated which currently results in a panic because there we filter for discarded candidates.

I think the root cause for this (as touched on in https://github.com/algesten/str0m/pull/497) is that Candidate is stateful. Couple of questions:

I was under the impression that a discarded candidate would never be nominated? That seems to be happening at the moment.
I think re-adding a candidate after it has been discarded should "activate" it again.
We use indices to candidates in a lot of places at the moment, hence we cannot remove discarded candidates from the list. Would you be okay with adding a dependency like slab to have stable indices and we able to actually remove candidates instead of making them stateful? I have a feeling this state management is a lost battle and will trigger more bugs.

algesten commented 5 months ago

I was under the impression that a discarded candidate would never be nominated? That seems to be happening at the moment.

Yes, that seems to be what's happening.

I think re-adding a candidate after it has been discarded should "activate" it again.

Agree. It's a scenario I haven't anticipated, but I don't think there should be any problem removing the discarded state and re-run the following code to make it appear in candidate pairs again.

We use indices to candidates in a lot of places at the moment, hence we cannot remove discarded candidates from the list. Would you be okay with adding a dependency like slab to have stable indices and we able to actually remove candidates instead of making them stateful? I have a feeling this state management is a lost battle and will trigger more bugs.

I don't see how Slab will help us here. For better or worse, IceAgent is built around indexes and whether that index points to an empty slab slot or a Candidate with a discarded state flag, I think will end up the same roughly the same.

I think an alternative impl would use LinkedList<Candidate> and assign an ID or even use Arc<Candidate> so that a CandidatePair can keep a direct reference. However I chose this path due to what I hoped would be performant (doing direct index lookups).

I don't think this bug will be that hard to fix, and it affects the way we use str0m too. I'll have a look now.

thomaseizinger commented 5 months ago

I don't see how Slab will help us here. For better or worse, IceAgent is built around indexes and whether that index points to an empty slab slot or a Candidate with a discarded state flag, I think will end up the same roughly the same.

I thought the issue with removing candidates from the list is that the remaining indices would be invalidated because the others are moved forward in memory.

With a slab (afaik), we can add and remove items without invalidating other indices. Invalidating other things like candidate pairs when a candidate is discarded is orthogonal to that and has to happen already today.

The difference with a slab would be that we can actually remove this candidates and don't have to filter them.

algesten commented 5 months ago

The difference with a slab would be that we can actually remove this candidates and don't have to filter them.

Sure, but that seems to be more of a convenience issue, i.e. list.iter().filter(|c| !c.discarded) vs slab.iter(), which is already solved in the current impl.

thomaseizinger commented 5 months ago

The difference with a slab would be that we can actually remove this candidates and don't have to filter them.

Sure, but that seems to be more of a convenience issue, i.e. list.iter().filter(|c| !c.discarded) vs slab.iter(), which is already solved in the current impl.

It is easy to forget to do this filtering though as we can see from this bug so I am not sure labelling it a convenience issue is accurate.

Isn't it also technically a memory-leak to never free up candidates internally dispite the user calling invalidate_candidate?

algesten commented 5 months ago

Here's another exciting aspect of manually discarding candidates.

https://github.com/algesten/str0m/blob/a3e0f13744ef8b0681bd34b97298dbe72085878a/src/ice/agent.rs#L472-L531

Add candidate A
Add candidate B. This auto-discards A in the above logic
Manually discard candidate B
What do we do with A?

As it stands, we do nothing with A, and I wonder if that is good or not?

invalidate_candidate is our own API, and the spec compliant way is ice_restart, so I don't want too much code gymnastics to solve these issues.

algesten commented 5 months ago

It is easy to forget to do this filtering though as we can see from this bug so I am not sure labelling it a convenience issue is accurate.

Well. It's an internal concern.

Isn't it also technically a memory-leak to never free up candidates internally dispite the user calling invalidate_candidate?

Slab is fixed space, so we wouldn't free any memory by removing a slab entry. ice_restart is the official API that would normally clean out the entire state. In some respects these are all compound problems on invalidate_candidate() – not saying it isn't fixable, but there's only so much code I want to add to solve a non-standard API.

thomaseizinger commented 5 months ago

Isn't it also technically a memory-leak to never free up candidates internally dispite the user calling invalidate_candidate?

Slab is fixed space, so we wouldn't free any memory by removing a slab entry. ice_restart is the official API that would normally clean out the entire state. In some respects these are all compound problems on invalidate_candidate() – not saying it isn't fixable, but there's only so much code I want to add to solve a non-standard API.

Okay, fair point :)

algesten commented 5 months ago

Hey @thomaseizinger you were invalidating remotes right?

I just realised that invalidating a remote doesn't necessarily shut down the traffic from that IP:

2024-04-28T08:04:11.993143Z TRACE str0m::ice_::agent: Discard pairs for remote candidate index: 0
2024-04-28T08:04:11.993174Z TRACE L: str0m::ice_::agent: Handle receive: StunMessage { method: Binding, class: Request, attrs: Attributes { username: "1nYb:SRio", message_integrity: [21, 57, 70, 19, 126, 38, 111, 148, 69, 253, 122, 191, 214, 27, 90, 18, 203, 96, 124, 186], fingerprint: 842275725, priority: 1862270719, ice_controlled: 15672487014400784862 }, integrity_len: 56 }
2024-04-28T08:04:11.993270Z TRACE L: str0m::ice_::agent: Message accepted
2024-04-28T08:04:11.993291Z  INFO L: str0m::ice_::agent: Created peer reflexive remote candidate from STUN request: Candidate(prflx=2.2.2.2:1000/udp prio=1862270719)
2024-04-28T08:04:11.993313Z DEBUG L: str0m::ice_::agent: Created new pair for STUN request: CandidatePair(0-1 prio=7998391838664818175 state=Waiting attempts=0 unanswered=0 remote=0 last=None nom=None)

This would happen if there's a "straggler" UDP packet coming from a remote that we invalidated. Since the message is accepted (the STUN credentials are correct), and there isn't a current remote candidate for it, we auto-create a peer reflexive.

Does that have any impact on your app logic?

thomaseizinger commented 5 months ago

I don't think so. Do we test peer-reflexive candidates? I've seen it in the logs when roaming and I thought it waits until an actual candidate gets added.

Perhaps a peer-reflexive candidate should not be created if there is a discarded candidate?

algesten commented 5 months ago

@thomaseizinger yeah. That's what I implemented in #508 – if there is a matching discarded, we don't add a peer reflexive

thomaseizinger commented 5 months ago

@thomaseizinger yeah. That's what I implemented in #508 – if there is a matching discarded, we don't add a peer reflexive

Just saw that :)

thomaseizinger commented 5 months ago

So does #508 also address not nominating discarded candidates? Or do we need another filter somewhere?

thomaseizinger commented 5 months ago

So does #508 also address not nominating discarded candidates? Or do we need another filter somewhere?

I think it might address it for the referenced case where the prev discarded candidate gets added again. But for defense-in-depth reasons, we might want to add another check somewhere that any candidate we nominate is definitely not discarded.

algesten commented 5 months ago

I think it might address it for the referenced case where the prev discarded candidate gets added again. But for defense-in-depth reasons, we might want to add another check somewhere that any candidate we nominate is definitely not discarded.

We can add a check, but I think it's rather redundant. It's a pair we nominate, not a candidate (although looking at the code, I think that detail is a bit confused, at least in the logging). And there are three places where pairs can form (add_remote_candidate and add_local_candidate and when making a dynamic peer-reflexive on a STUN request). All those three seem to be guarded against using discarded.

But I don't mind a double check.

thomaseizinger commented 5 months ago

We can add a check, but I think it's rather redundant.

Yes, it would be more of a defense-in-depth thing in case we ever add another place where pairs are formed.

algesten commented 5 months ago

PR welcome!

algesten / str0m

Better handling of discarded candidates #507