Closed algesten closed 2 months ago
@thomaseizinger I added failing tests. Now I need to fix them.
We're seeing issues with this PR over in firezone/firezone on all Clients/Gateways using this PR (the crash is fixed though, thanks for the fix there!). We're not sure what the specific issue is, but it seems like successfully nominated candidates end up being discarded. The behavior we're seeing is that the connection initially succeeds ("Checking -> Connected") but then immediately disconnected ("Checking -> Disconnected") with "none nominated".
Here's a snippet of the issue:
connlib 13:14:14.032834-0700 Nominated pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Nominated)
connlib 13:14:14.033086-0700 State change (got nomination, still trying others): Checking -> Connected
connlib 13:14:14.037117-0700 Nominated attempt STUN binding: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Nominated)
connlib 13:14:14.037373-0700 Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, use_candidate: true, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:14.037621-0700 Updating remote socket duration_since_intent=651.565209ms new=Direct { source: 192.168.1.225:49762, dest: 34.138.249.166:51208 } old=None
...
connlib 13:13:12.924057-0700 Remove failed pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=8 unanswered=8 remote=0 last=None nom=Success)
From the above it looks like CandidatePair 0-4 succeeded then immediately was discarded. cc @thomaseizinger (currently on PTO).
Here is a full log dump if it's helpful: log.txt.zip
@jamilbk It looks a bit like we're not getting more traffic from the other side.
With some sleuthing I think we can say the candidate pair 0-4 corresponds to:
connlib 13:14:13.585958-0700 Add local candidate: Candidate(host=192.168.1.225:49762/udp prio=2130706175)
connlib 13:14:13.638333-0700 Add remote candidate: Candidate(relay=34.138.249.166:51208/udp prio=50331391)
So we're looking for traffic going 192.168.1.225:49762 <-> 34.138.249.166:51208
grep 34.138.249.166:51208 log.txt
...
connlib 13:14:13.938348-0700 Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:13.983793-0700 Send STUN reply: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Success, attrs: Attributes { xor_mapped_address: 34.138.249.166:51208 }, integrity_len: 0 }
connlib 13:14:14.037373-0700 Send STUN request: 192.168.1.225:49762 -> 34.138.249.166:51208 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "E2uq:glp2", priority: 1862270719, use_candidate: true, ice_controlling: 13946265949209218412 }, integrity_len: 0 }
connlib 13:14:14.037621-0700 Updating remote socket duration_since_intent=651.565209ms new=Direct { source: 192.168.1.225:49762, dest: 34.138.249.166:51208 } old=None
…
We see a whole bunch of STUN requests that never get an answer, and that we send a single STUN reply to an incoming request from the remote peer.
This is confirmed by this line
connlib 13:14:24.738164-0700 Remove failed pair: CandidatePair(0-4 prio=216171682568601087 state=Succeeded attempts=8 unanswered=8 remote=1 last=Some(Instant { tv_sec: 317274, tv_nsec: 755523958 }) nom=Success)
remote=1
means we seen one message from the remote (request or reply), in this case that's that single request which we replied to initially.
It doesn't look like connlib
invalidates any in-use candidate, remote or local. The next thing to check is whether packets are indeed arriving from remote, but str0m rejects for some reason.
This is the point where incoming packets are treated either as incoming requests, or replies to requests we sent. In both these branches there are further points where the packet might be discarded more silently. https://github.com/algesten/str0m/blob/main/src/ice/agent.rs#L911-L915
Doing more testing:
It does seem like we receive the Handle received message
and establish a connection. It seems intermittent, sometimes the connection lasts, sometimes it doesn't.
connlib 14:25:23.433845-0700 Check if accepts message: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.433911-0700 Message accepted
connlib 14:25:23.433952-0700 Handle receive: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.433994-0700 Check if accepts message: StunMessage { method: Binding, class: Success, attrs: Attributes { message_integrity: [167, 42, 174, 120, 126, 138, 85, 244, 22, 144, 22, 130, 124, 25, 251, 50, 133, 119, 226, 237], xor_mapped_address: 107.197.104.68:59092, fingerprint: 1550458437 }, integrity_len: 32 }
connlib 14:25:23.434031-0700 Message accepted
connlib 14:25:23.434063-0700 Found local candidate for mapped address: 107.197.104.68:59092
connlib 14:25:23.434097-0700 Check state: InProgress -> Succeeded
connlib 14:25:23.434134-0700 Recorded binding response: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=None)
connlib 14:25:23.434166-0700 Nominating best candidate
connlib 14:25:23.434201-0700 Nominated pair: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.434236-0700 Enqueueing event: NominatedSend { proto: Udp, source: 192.168.1.225:59092, destination: 35.235.125.17:60740 }
connlib 14:25:23.434269-0700 State change (got nomination, still trying others): Checking -> Connected
connlib 14:25:23.434302-0700 Enqueueing event: IceConnectionStateChange(Connected)
connlib 14:25:23.434334-0700 Enqueueing event: DiscoveredRecv { proto: Udp, source: 35.235.125.17:60740 }
connlib 14:25:23.458650-0700 Handle next triggered pair: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.458718-0700 Nominated attempt STUN binding: CandidatePair(0-2 prio=216171682568601087 state=Succeeded attempts=1 unanswered=0 remote=0 last=None nom=Nominated)
connlib 14:25:23.458796-0700 Send STUN request: 192.168.1.225:59092 -> 35.235.125.17:60740 StunMessage { method: Binding, class: Request, attrs: Attributes { username: "QNb9:QpXl", priority: 1862270719, use_candidate: true, ice_controlling: 9160184061805568986 }, integrity_len: 0 }
connlib 14:25:23.458859-0700 Poll event: Some(NominatedSend { proto: Udp, source: 192.168.1.225:59092, destination: 35.235.125.17:60740 })
connlib 14:25:23.458918-0700 Updating remote socket duration_since_intent=766.81325ms new=Direct { source: 192.168.1.225:59092, dest: 35.235.125.17:60740 } old=None
This is with str0m=trace
. I'll continue to test and post updates here. The weird thing is that it's intermittent.
Right.
https://github.com/algesten/str0m/blob/main/src/io/stun.rs#L11-L33
If there is connectivity, I think there needs to be 18 seconds of missed packets before it disconnects. The logic is above.
I think we finally managed to capture a decent single log snapshot of the issue. This seems to be the issue?
2024-04-30T21:59:26.432203Z TRACE connection{id=c2806b19-678f-4e63-9a36-4d6d041290d5}: str0m::ice_::agent: STUN request ignored because remote candidate is discarded
Here is a full snapshot of the client and gateway log of the issue in isolation:
This looks promising. We need to figure out why it is discarded.
I'm in bed now, but i can assist in the morning.
This fixes re-adding of previously invalidated ice candidates.
Close #507