dmotz / trystero

✨🤝✨ Build instant multiplayer webapps, no server required — Magic WebRTC matchmaking over BitTorrent, Nostr, MQTT, IPFS, Supabase, and Firebase
https://oxism.com/trystero
MIT License
1.27k stars 90 forks source link

Re-connect #29

Open tomberek opened 1 year ago

tomberek commented 1 year ago

I'm getting some re-connection difficulties. 1) Two clients connect (one is a phone). 2) Phone display is turned off. 3) 2 minutes later the connection is broken and the other browser shows this error in console.

crypto-254fb80e.js:232 Error: Connection failed.
    at Peer._onConnectionStateChange (simple-peer-light.js:550:28)
    at Peer._pc.onconnectionstatechange (simple-peer-light.js:105:12)

I attempted various combinations of .leave() and .joinRoom without luck.

jeremyckahn commented 1 year ago

This seems like it may be a simple-peer-light issue. I've noticed that it doesn't handle reconnections after system suspension (sleep) well. Specifically, when there is an ICE connection failure, it doesn't attempt to reconnect: https://github.com/mitschabaude/simple-peer-light/blob/503028ee121687d87bb7042b24f6934a0b06ca14/index.js#L745-L752

I wonder if the connection failure can be detected by Trystero and so that it can attempt to reconnect? If not, simple-peer-light may need to be patched directly.

jeremyckahn commented 1 year ago

Considering this:

I attempted various combinations of .leave() and .joinRoom without luck.

I don't think this is a failure mode that Trystero or application code can handle. It seems that Trystero is appropriately cleaning up Peer instances:

https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/room.js#L324-L330 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/room.js#L24 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L253-L262 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L223-L231

So, it doesn't look like there's much that Trystero can do if leaving and rejoining the room doesn't fix the connection. According to the WebRTC spec, a "failed" state is terminal:

The RTCIceTransport has finished gathering, received an indication that there are no more remote candidates, finished checking all candidate pairs, and all pairs have either failed connectivity checks or lost consent, and either zero local candidates were gathered or the PAC timer has expired [RFC8863]. This is a terminal state until ICE is restarted.

Luckily, there appears to be a RTCPeerConnection.restartIce() method. It does not appear that simple-peer-light ever calls it: https://github.com/mitschabaude/simple-peer-light/search?q=restartIce&type=code

So, it seems that simple-peer-light will need to be patched to enable reconnections due to ICE failures. After that, it might be worth adding some connection retry logic to Trystero.

EDIT: There appears to be prior art that we may be able to use for implementing ICE restarts here: https://github.com/feross/simple-peer/pull/771/files

tomberek commented 1 year ago

The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.

jeremyckahn commented 1 year ago

The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.

I'm not sure. It would probably take some experimentation. What I'm struggling with is how to trigger an ICE candidate failure so that error handling could be tested and debugged. Do you have a reliable way of doing that?

jeremyckahn commented 1 year ago

I experienced this issue again today. I noticed in Chrome's chrome://webrtc-internals/ page that there were four connections that I couldn't get rid of via leave()ing and re-joinRoom()ing again. I could create ten new ones and cleanly remove them by leaving and rejoining, but the four "zombie" connections persisted until I refreshed the page. I suspect that "zombie" connections may be preventing reconnection (though I cannot prove that).

I need to spend more time to better understand Trystero's connection lifecycle, but I noticed this onDisconnect function: https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L242

It is called upon peer close events: https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L126 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L152

I wonder if this could be bound to the peer's events.error event as well? I'm thinking that connection error events aren't fully handled by Trystero, so they don't get properly cleaned up and therefore prevent reconnection.

@dmotz do you thoughts about handling error events by calling onDisconnect when there is an error?

jeremyckahn commented 1 year ago

I'm able to reliably reproduce the error state with two computers. On one computer, I just disable WiFi, wait 30-60 seconds, and turn it back on. The two peers can no longer connect. Leaving and rejoining the room on either computer doesn't reestablish connection.

However, I noticed that if both computers leave and rejoin the room, connection can successfully be reestablished. Sometimes it takes more than one try to leave and rejoin the room.

This suggests that the re-connection issue can be addressed within Trystero. It's worth exploring what the leave action is doing to clean up connections and see if it can be reused during a live connection as peers cycle in and out of the room.

jeremyckahn commented 1 year ago

I've got a PR to fix this at #30. @tomberek please test it out and see if it fixes the issue for you.

tomberek commented 1 year ago

Sadly, no. The same sequence of events still did not lead to a re-connection. I suspect one of the peers is trying, but no the other.

jeremyckahn commented 1 year ago

This is consistent with my experience on mobile, even with this fix. There appears to be some number of iOS-specific issues left to deal with, so we will have to iterate on a full solution for this issue.

Does #30 at least improve reconnection success for you on desktop Chrome and Firefox?

tomberek commented 1 year ago

@jeremyckahn Yes, #30 improves leave/rejoin behavior between desktop browsers. Thanks!

jeremyckahn commented 1 year ago

That's great to hear! Thanks for testing @tomberek.

If #30 seems good to @dmotz, I recommend merging and releasing the change when it's convenient but leaving this issue open. I predict that this problem will take time and iteration to truly solve, but it can be gradually improved over time with incremental fixes.

dmotz commented 1 year ago

I released the related PR as 0.11.8. I haven't had a chance to look into this much yet, but will try to explore soon. It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side. If it's the latter I'm open to finding a new RTC connection management lib or building one in.

jeremyckahn commented 1 year ago

Thank you for the 0.11.8 release, @dmotz!

It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side.

That's a great idea. I've only used the WebTorrent strategy so far so I don't know how its reliability might compare to Firebase in practice. I've noticed that clients are generally able to connect to WebTorrent servers and exchange SDPs. I think the remaining issues (at least on desktop browsers) are specific longer-running sessions, which is going to be challenging to reproduce and debug.

I think the iOS issues are a result of limitations that Apple has built in paired with platform-specific bugs. For example:

There may be only so much we can do for iOS until Apple fixes their platform-level issues. 😕


In case it's helpful, https://www.farmhand.life/#online/global (source) is running my Trystero branch (which, at the time of this comment, is in sync with the 0.11.8 release). I leave sessions running for days at a time, which is how I can observe the issues related to longer-lived sessions. It may be a helpful way to manifest issues for testing and debugging purposes.

aehlke commented 1 year ago

@jeremyckahn any updates on your experience?

jeremyckahn commented 1 year ago

@jeremyckahn any updates on your experience?

No, it's about the same since my last comment. I haven't been exploring connection improvements since then.