gfodor / p2pcf

Low cost, low effort P2P WebRTC serverless signalling using Cloudflare Workers
MIT License
964 stars 53 forks source link

bug: cf worker shouldn't disconnect peers #15

Open ThaUnknown opened 2 years ago

ThaUnknown commented 2 years ago

the CF worker really shouldn't disconnect peers, this should happen between the peers themselves

the peer list often takes time to propagate tru cloudflare, or sometimes doesn't at all, leading to people randomly disconnecting with eachother

gfodor commented 2 years ago

hmm - the peer list is stored in R2 which has global strong consistency (read after write) except if there is a network partition. at least, that's what they say. so there shouldn't be any latency. If a peer that was once in the list is no longer there, then they timed out or gracefully disconnected. if the timeout is too short, you can adjust it in the constructor.

ThaUnknown commented 2 years ago

i can assure you it's not consistent, anyways this should probably work like a BitTorrent tracker where it announces peers and doesn't dictate them

gfodor commented 2 years ago

can you describe exactly what is happening? because the removal case happens when a peer is in the list, and then is not in the list. it's not going to do that if there was latency in the peer arriving into the list.

gfodor commented 2 years ago

Are you seeing behavior inconsistent with what Cloudflare writes here?

image

I'm not opposed to changing this, but I don't want to change it for the wrong reason.

gfodor commented 2 years ago

I think a better change is to keep the peer if it is connected. The scenario this is trying to cover is where you've joined a room and there are latent peers leftover from previous sessions that never disconnected gracefully (so aren't actually going to work, but whose RTCPeerConnections are created anyway), and you want to remove the unneeded RTCPeerConnections after they are timed out by the signalling server.

ThaUnknown commented 2 years ago

can you describe exactly what is happening? because the removal case happens when a peer is in the list, and then is not in the list. it's not going to do that if there was latency in the peer arriving into the list.

the worker returns inconsistent peers from those who are expected to be in the lobby, this leads to people connecting to eachother then disconnecting the very next refresh, or even before they can finish connecting, the worker really shouldn't dictate which peers you should disconnect from

gfodor commented 2 years ago

yep - if you update this PR to just remove peers who are not currently connected I'll merge it. I want to leave this case in for the other one I mentioned: you join a room where there are still peers registered who didn't gracefully disconnect their sessions and you need to clear up the peer connections you created for them.

gfodor commented 2 years ago

that said, if you are seeing peers disappear from the worker who were in there, this sounds like a much deeper problem that needs to be understood on its own. either R2 is not living up to its guarantees (unlikely) or there are race conditions I need to deal with (more likely.)

gfodor commented 2 years ago

oh now that you mention it I'm re-reading the code, and actually if there are a bunch of rapidly joining peers there might be some shuffling (despite R2's guarantees) since they do compete for slots in the list. the idea was that the state would eventually converge on subsequent polls. i never actually saw this happen in testing but i think what you're saying sounds possible with the current implementation, so i think I can cross off the concern that I don't understand what's going on.

in any case, update your PR and it will fix this problem.