firezone / firezone

Enterprise-ready zero-trust access platform built on WireGuard®.
https://www.firezone.dev
Apache License 2.0
6.83k stars 284 forks source link

Increase `MAX_PARTITION_TIME` substantially to prevent signing out clients without internet connectivity #3334

Closed jamilbk closed 9 months ago

jamilbk commented 9 months ago

For explanation see this comment. Otherwise a client that loses internet connectivity for a few minutes will be signed out. This will cause problems with clients operating headlessly.

Instead, we should only call on_disconnect(errorString) if there's been a fatal error that absolutely requires signing out and clearing the token.

jamilbk commented 9 months ago

I think if we just keep retrying connection to portal every 5 seconds, we don't need to detect internet connectivity changes in the clients (except the Apple client, which requires this because of its DNS limitations which means getSystemDefaultResolvers cannot be used while the connlib session is active).

The end-user experience here would then be:

  1. The Firezone client is always "Signed in". Only on_disconnect(errorString) is called for 4xx errors and non-recoverable errors, like if the tunnel interface can't be created. For those cases, we sign the user out.
  2. If internet connectivity for a client drops temporarily (switching networks), Firezone will remain "Signed in". When connlib reconnects to the portal successfully, it calls getSystemDefaultResolvers again and re-initializes the interface.
  3. If internet connectivity drops completely for a long time (use-case: clients with spotty cellular connections), Firezone is still signed in with connlib retrying in the background (every 5 seconds). This is the part that's broken today

The edge case here is if the DNS servers change for an interface without any other network connectivity changes. If connlib doesn't detect portal connectivity loss, it won't retry getting the DNS servers. For this edge case, we can handle it one of two ways:

  1. Either the OS detects these "network changes" and disconnects and reconnects connlib, keeping the token. This is how the Apple client functions. This requires an "in-between" state of "signed in but not connected". OR
  2. Connlib occasionally calls getSystemDefaultResolvers in the background on, say a 5-second timer. If they're different, it re-initializes the resolver to use these new ones.

Option (2) I think this eliminates the need (for quite some time) for clients to rely on OS-level APIs to "detect network connectivity" and respond, because that requires complex state tracking like the one we have in the Apple client. It means the client will need to handle a "Signed in but not connected" state which we probably need to reflect in the UI as well (sad face).

Based on the above, I think it makes sense to forgo the client-side connectivity tracking in lieu of the approach I outlined above in Windows and Android clients.

Apple client can remain unchanged.

cc @conectado @ReactorScram @jasonboukheir

ReactorScram commented 9 months ago

https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Edit: The backoff crate already defaults to using some jitter, perfect https://docs.rs/backoff/0.4.0/src/backoff/default.rs.html#7

Just wanted to say if the portal goes down or for some reason a customer loses Internet and like 500 clients try to reconnect all at once, some jitter might help reduce the thundering herd effect on the portal server

Had it happen at a previous job where a customer would lose power and 100-ish IoT devices would all boot up exactly X seconds later and then all hammer the server at the same time