Open tomberek opened 1 year ago
This seems like it may be a simple-peer-light
issue. I've noticed that it doesn't handle reconnections after system suspension (sleep) well. Specifically, when there is an ICE connection failure, it doesn't attempt to reconnect: https://github.com/mitschabaude/simple-peer-light/blob/503028ee121687d87bb7042b24f6934a0b06ca14/index.js#L745-L752
I wonder if the connection failure can be detected by Trystero and so that it can attempt to reconnect? If not, simple-peer-light
may need to be patched directly.
Considering this:
I attempted various combinations of
.leave()
and.joinRoom
without luck.
I don't think this is a failure mode that Trystero or application code can handle. It seems that Trystero is appropriately cleaning up Peer
instances:
https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/room.js#L324-L330 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/room.js#L24 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L253-L262 https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L223-L231
So, it doesn't look like there's much that Trystero can do if leaving and rejoining the room doesn't fix the connection. According to the WebRTC spec, a "failed" state is terminal:
The RTCIceTransport has finished gathering, received an indication that there are no more remote candidates, finished checking all candidate pairs, and all pairs have either failed connectivity checks or lost consent, and either zero local candidates were gathered or the PAC timer has expired [RFC8863]. This is a terminal state until ICE is restarted.
Luckily, there appears to be a RTCPeerConnection.restartIce()
method. It does not appear that simple-peer-light
ever calls it: https://github.com/mitschabaude/simple-peer-light/search?q=restartIce&type=code
So, it seems that simple-peer-light
will need to be patched to enable reconnections due to ICE failures. After that, it might be worth adding some connection retry logic to Trystero.
EDIT: There appears to be prior art that we may be able to use for implementing ICE restarts here: https://github.com/feross/simple-peer/pull/771/files
The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.
The entire process cannot be restarted upon identification of the failure? I mean from the side of the suspended client.
I'm not sure. It would probably take some experimentation. What I'm struggling with is how to trigger an ICE candidate failure so that error handling could be tested and debugged. Do you have a reliable way of doing that?
I experienced this issue again today. I noticed in Chrome's chrome://webrtc-internals/ page that there were four connections that I couldn't get rid of via leave()
ing and re-joinRoom()
ing again. I could create ten new ones and cleanly remove them by leaving and rejoining, but the four "zombie" connections persisted until I refreshed the page. I suspect that "zombie" connections may be preventing reconnection (though I cannot prove that).
I need to spend more time to better understand Trystero's connection lifecycle, but I noticed this onDisconnect
function: https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L242
It is called upon peer close
events:
https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L126
https://github.com/dmotz/trystero/blob/78a65e77866edfadb54d880fb12466090524789c/src/torrent.js#L152
I wonder if this could be bound to the peer
's events.error
event as well? I'm thinking that connection error events aren't fully handled by Trystero, so they don't get properly cleaned up and therefore prevent reconnection.
@dmotz do you thoughts about handling error events by calling onDisconnect
when there is an error?
I'm able to reliably reproduce the error state with two computers. On one computer, I just disable WiFi, wait 30-60 seconds, and turn it back on. The two peers can no longer connect. Leaving and rejoining the room on either computer doesn't reestablish connection.
However, I noticed that if both computers leave and rejoin the room, connection can successfully be reestablished. Sometimes it takes more than one try to leave and rejoin the room.
This suggests that the re-connection issue can be addressed within Trystero. It's worth exploring what the leave action is doing to clean up connections and see if it can be reused during a live connection as peers cycle in and out of the room.
I've got a PR to fix this at #30. @tomberek please test it out and see if it fixes the issue for you.
Sadly, no. The same sequence of events still did not lead to a re-connection. I suspect one of the peers is trying, but no the other.
This is consistent with my experience on mobile, even with this fix. There appears to be some number of iOS-specific issues left to deal with, so we will have to iterate on a full solution for this issue.
Does #30 at least improve reconnection success for you on desktop Chrome and Firefox?
@jeremyckahn Yes, #30 improves leave/rejoin behavior between desktop browsers. Thanks!
That's great to hear! Thanks for testing @tomberek.
If #30 seems good to @dmotz, I recommend merging and releasing the change when it's convenient but leaving this issue open. I predict that this problem will take time and iteration to truly solve, but it can be gradually improved over time with incremental fixes.
I released the related PR as 0.11.8. I haven't had a chance to look into this much yet, but will try to explore soon. It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side. If it's the latter I'm open to finding a new RTC connection management lib or building one in.
Thank you for the 0.11.8 release, @dmotz!
It may be worth doing some tests with the Firebase strategy to determine whether the problem lies on the matchmaking side or the WebRTC library side.
That's a great idea. I've only used the WebTorrent strategy so far so I don't know how its reliability might compare to Firebase in practice. I've noticed that clients are generally able to connect to WebTorrent servers and exchange SDPs. I think the remaining issues (at least on desktop browsers) are specific longer-running sessions, which is going to be challenging to reproduce and debug.
I think the iOS issues are a result of limitations that Apple has built in paired with platform-specific bugs. For example:
There may be only so much we can do for iOS until Apple fixes their platform-level issues. 😕
In case it's helpful, https://www.farmhand.life/#online/global (source) is running my Trystero branch (which, at the time of this comment, is in sync with the 0.11.8 release). I leave sessions running for days at a time, which is how I can observe the issues related to longer-lived sessions. It may be a helpful way to manifest issues for testing and debugging purposes.
@jeremyckahn any updates on your experience?
@jeremyckahn any updates on your experience?
No, it's about the same since my last comment. I haven't been exploring connection improvements since then.
I'm getting some re-connection difficulties. 1) Two clients connect (one is a phone). 2) Phone display is turned off. 3) 2 minutes later the connection is broken and the other browser shows this error in console.
I attempted various combinations of
.leave()
and.joinRoom
without luck.