algesten / str0m

A synchronous sans I/O WebRTC implementation in Rust.
MIT License
280 stars 45 forks source link

Re-add invalidated remote/local candidates #508

Closed algesten closed 2 months ago

algesten commented 2 months ago

This fixes re-adding of previously invalidated ice candidates.

Close #507

algesten commented 2 months ago

@thomaseizinger I added failing tests. Now I need to fix them.

jamilbk commented 2 months ago

We're seeing issues with this PR over in firezone/firezone on all Clients/Gateways using this PR (the crash is fixed though, thanks for the fix there!). We're not sure what the specific issue is, but it seems like successfully nominated candidates end up being discarded. The behavior we're seeing is that the connection initially succeeds ("Checking -> Connected") but then immediately disconnected ("Checking -> Disconnected") with "none nominated".

Here's a snippet of the issue:

connlib 13:14:14.032834-0700    Nominated pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Nominated)
connlib 13:14:14.033086-0700    State change (got nomination, still trying others): Checking -> Connected
connlib 13:14:14.037117-0700    Nominated attempt STUN binding: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Nominated)
connlib 13:14:14.037373-0700    Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, use_candidate: true, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:14.037621-0700    Updating remote socket  duration_since_intent=651.565209ms new=Direct { source: 192.168.1.225:49762, dest: 34.138.249.166:51208 } old=None
...
connlib 13:13:12.924057-0700    Remove failed pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=8 unanswered=8 remote=0 last=None nom=Success)

From the above it looks like CandidatePair 0-4 succeeded then immediately was discarded. cc @thomaseizinger (currently on PTO).

Here is a full log dump if it's helpful: log.txt.zip

algesten commented 2 months ago

@jamilbk It looks a bit like we're not getting more traffic from the other side.

With some sleuthing I think we can say the candidate pair 0-4 corresponds to:

connlib 13:14:13.585958-0700    Add local candidate: Candidate(host=192.168.1.225:49762/udp prio=2130706175)
connlib 13:14:13.638333-0700    Add remote candidate: Candidate(relay=34.138.249.166:51208/udp prio=50331391)

So we're looking for traffic going 192.168.1.225:49762 <-> 34.138.249.166:51208

grep 34.138.249.166:51208 log.txt
...
connlib 13:14:13.938348-0700    Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:13.983793-0700    Send STUN reply: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Success, attrs: Attributes { xor_mapped_address: 34.138.249.166:51208 }, integrity_len: 0 }
connlib 13:14:14.037373-0700    Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, use_candidate: true, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:14.037621-0700    Updating remote socket  duration_since_intent=651.565209ms new=Direct { source: 192.168.1.225:49762, dest: 34.138.249.166:51208 } old=None
…

We see a whole bunch of STUN requests that never get an answer, and that we send a single STUN reply to an incoming request from the remote peer.

This is confirmed by this line

connlib 13:14:24.738164-0700    Remove failed pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=8 unanswered=8 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Success)

remote=1 means we seen one message from the remote (request or reply), in this case that's that single request which we replied to initially.

It doesn't look like connlib invalidates any in-use candidate, remote or local. The next thing to check is whether packets are indeed arriving from remote, but str0m rejects for some reason.

This is the point where incoming packets are treated either as incoming requests, or replies to requests we sent. In both these branches there are further points where the packet might be discarded more silently. https://github.com/algesten/str0m/blob/main/src/ice/agent.rs#L911-L915

jamilbk commented 2 months ago

Doing more testing:

It does seem like we receive the Handle received message and establish a connection. It seems intermittent, sometimes the connection lasts, sometimes it doesn't.

connlib 14:25:23.433845-0700    Check if accepts message: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.433911-0700    Message accepted
connlib 14:25:23.433952-0700    Handle receive: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.433994-0700    Check if accepts message: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.434031-0700    Message accepted
connlib 14:25:23.434063-0700    Found local candidate for mapped address: 107.197.104.68:59092
connlib 14:25:23.434097-0700    Check state: InProgress -> Succeeded
connlib 14:25:23.434134-0700    Recorded binding response: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=None)
connlib 14:25:23.434166-0700    Nominating best candidate
connlib 14:25:23.434201-0700    Nominated pair: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.434236-0700    Enqueueing event: NominatedSend { proto: Udp, source: 192.168.1.225:59092, destination: 35.235.125.17:60740 }
connlib 14:25:23.434269-0700    State change (got nomination, still trying others): Checking -> Connected
connlib 14:25:23.434302-0700    Enqueueing event: IceConnectionStateChange(Connected)
connlib 14:25:23.434334-0700    Enqueueing event: DiscoveredRecv { proto: Udp, source: 35.235.125.17:60740 }
connlib 14:25:23.458650-0700    Handle next triggered pair: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.458718-0700    Nominated attempt STUN binding: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.458796-0700    Send STUN request: 192.168.1.225:59092 -> 35.235.125.17:60740 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "QNb9:QpXl", priority: 1862270719, use_candidate: true, ice_controlling: 9160184061805568986 }, integrity_len: 0 }
connlib 14:25:23.458859-0700    Poll event: Some(NominatedSend { proto: Udp, source: 192.168.1.225:59092, destination: 35.235.125.17:60740 })
connlib 14:25:23.458918-0700    Updating remote socket  duration_since_intent=766.81325ms new=Direct { source: 192.168.1.225:59092, dest: 35.235.125.17:60740 } old=None

This is with str0m=trace. I'll continue to test and post updates here. The weird thing is that it's intermittent.

algesten commented 2 months ago

Right.

https://github.com/algesten/str0m/blob/main/src/io/stun.rs#L11-L33

If there is connectivity, I think there needs to be 18 seconds of missed packets before it disconnects. The logic is above.

jamilbk commented 2 months ago

I think we finally managed to capture a decent single log snapshot of the issue. This seems to be the issue?

2024-04-30T21:59:26.432203Z TRACE connection{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: STUN request ignored because remote candidate is discarded

Here is a full snapshot of the client and gateway log of the issue in isolation:

Archive.zip

algesten commented 2 months ago

This looks promising. We need to figure out why it is discarded.

I'm in bed now, but i can assist in the morning.