Closed jamilbk closed 9 months ago
I think if we just keep retrying connection to portal every 5 seconds, we don't need to detect internet connectivity changes in the clients (except the Apple client, which requires this because of its DNS limitations which means getSystemDefaultResolvers
cannot be used while the connlib session is active).
The end-user experience here would then be:
on_disconnect(errorString)
is called for 4xx errors and non-recoverable errors, like if the tunnel interface can't be created. For those cases, we sign the user out.getSystemDefaultResolvers
again and re-initializes the interface.The edge case here is if the DNS servers change for an interface without any other network connectivity changes. If connlib doesn't detect portal connectivity loss, it won't retry getting the DNS servers. For this edge case, we can handle it one of two ways:
getSystemDefaultResolvers
in the background on, say a 5-second timer. If they're different, it re-initializes the resolver to use these new ones.Option (2) I think this eliminates the need (for quite some time) for clients to rely on OS-level APIs to "detect network connectivity" and respond, because that requires complex state tracking like the one we have in the Apple client. It means the client will need to handle a "Signed in but not connected" state which we probably need to reflect in the UI as well (sad face).
Based on the above, I think it makes sense to forgo the client-side connectivity tracking in lieu of the approach I outlined above in Windows and Android clients.
Apple client can remain unchanged.
cc @conectado @ReactorScram @jasonboukheir
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Edit: The backoff
crate already defaults to using some jitter, perfect https://docs.rs/backoff/0.4.0/src/backoff/default.rs.html#7
Just wanted to say if the portal goes down or for some reason a customer loses Internet and like 500 clients try to reconnect all at once, some jitter might help reduce the thundering herd effect on the portal server
Had it happen at a previous job where a customer would lose power and 100-ish IoT devices would all boot up exactly X seconds later and then all hammer the server at the same time
For explanation see this comment. Otherwise a client that loses internet connectivity for a few minutes will be signed out. This will cause problems with clients operating headlessly.
Instead, we should only call
on_disconnect(errorString)
if there's been a fatal error that absolutely requires signing out and clearing the token.