Make censored peers prioritize working STUN servers

noahlevenson commented 9 months ago

Via @oxtoacart on Slack:

Can we make the censored peers smart enough to prioritize available STUN servers based on which ones have worked recently in order to maximize the chance of them starting with a working one?

Writing a STUNBatch function which prioritizes recently-working servers over randomly chosen ones is probably a 30-minute refactor.

Persisting it to disk to maintain state between restarts is a bigger project.

We should keep an eye on both of these optimizations, though.

noahlevenson commented 9 months ago

I dug into this today and discovered it's a bit more complex than I originally thought. I also had to refresh my memory on the motivation for the STUNBatch function. I've arrived at an idea for how to best improve things here, which I will now attempt to explain, mostly for my own comprehension:

The total set of known STUN servers is always changing. And we collect STUN servers from all kinds of different places. Right now, we pass STUN servers to censored clients in the configuration they receive from config-server or lantern-cloud. Uncensored clients actually fetch a public list of STUN servers which is continuously refreshed and republished. Soon we may have peers act as STUN servers, and they churn constantly. We also want to crawl the IPv4 space in-country and find unblocked STUN servers that we can surreptitiously serve to our clients.

Having said this, it would be quite dumb if the only STUN servers your client ever knew about were the STUN servers we happened to know about at the time your client booted. Your client shouldn't get dumber and more stale the longer it's been online. Instead, clients should have some way to fetch the most up-to-date information about known STUN servers.

So that's basically why we invented the STUNBatch concept. We created a simple abstraction outside of the protocol logic that's responsible for figuring out how to fetch the most updated set of STUN servers. The STUNBatch function is a single interface that hides some very different implementations depending on peer role, since acquiring information is much easier in the uncensored world than in the censored world. But when the protocol logic needs some STUN servers, it doesn't need to understand much about the problem -- it simply calls out to the STUNBatch function, which knows how to return N STUN servers, and that's that.

But there's a problem: the STUNBatch system is intentionally stateless. There's no I/O between the state machines and the STUNBatch function, so the state machines have no way of telling the STUNBatch system that the STUN servers it delivered are actually no good and not working.

This leads to some seriously gnarly inefficiencies. Currently, censored clients ask for a batch of 1 STUN server on each signaling attempt. What if 30% of the total set of STUN servers are blocked? Since the STUNBatch function is "dumb," signaling will fail 30% of the time.

The separation of concerns suggests that the STUNBatch system should remain stateless, since determining the success or failure of STUN requests is quite a different job from that of assembling and serving a set of hostnames.

Thus, I think the following is the correct approach:

We'll introduce the concept of a STUN cache. The STUN cache is just a slice of STUN servers. When a client boots, it will ask the STUNBatch function to return every single STUN server it currently knows about (by making a request of size positive infinity). The client will populate its STUN cache with the result, and it will shuffle the list.

When a client needs a STUN server for a signaling attempt, it performs the following steps:

Check the length of the STUN cache. If zero, run the boot sequence to populate the STUN cache.
Select the 0th STUN server from the STUN cache.
If that STUN server returns a binding response, do nothing. If that STUN server fails to return a binding response, remove it from the STUN cache.

This produces the following desirable behaviors:

Clients, upon finding a STUN server that works, will keep using that STUN server. When and if that STUN server fails, clients will try a new random STUN server until they find another one that works. When clients exhaust the entire list of STUN servers, they re-fetch the list of known STUN servers, under the assumption that it's quite possible that the list has changed in the interim.

noahlevenson commented 9 months ago

One asterisk I forgot to mention:

This is very neat and tidy when the number of ICE agents = 1, but it gets a lil hairy when you try to generalize to N. That's because at the WebRTC layer, I don't think it's possible to interrogate which of your ICE agents produced successful responses.

Put another way:

If you have 1 ICE agent, it's easy to know who to blame for success or failure.

If you have N ICE agents, failure means that everyone failed. But success only means that one of them worked.

So if your client wants to use N ICE agents, it would select a "cohort" consisting of the first N hosts in the STUN cache. On failure, everyone in the cohort gets deleted, because we know they all failed. But on success, everyone stays, despite the fact that some of the cohort may have failed.

myleshorton commented 9 months ago

Having said this, it would be quite dumb if the only STUN servers your client ever knew about were the STUN servers we happened to know about at the time your client booted. Your client shouldn't get dumber and more stale the longer it's been online. Instead, clients should have some way to fetch the most up-to-date information about known STUN servers.

Just one quick note here is that clients fetch new configs quite frequently -- like every minute by default. Otherwise this all generally sounds good.

noahlevenson commented 9 months ago

https://github.com/getlantern/broflake/commit/cb21b80ac8584da76ec9903266e0840edf36b0c8 implements the STUN cache concept. Gonna close this out...

getlantern / browsersunbounded

Make censored peers prioritize working STUN servers #186