(Continuing a discussion that started in Slack here.)
We received some reports of the Lyft alpha app stalling when using Envoy Mobile when the following happens:
User opens the app on wifi and makes some requests
User switches to cellular (i.e., by getting into an elevator)
Requests start failing with onError being called with upstream request timeout errors
Exactly 2 minutes after requests start failing, requests start succeeding again
During this time, the client is making requests every few seconds, and each one fails after timing out (15s later since the timeout specified is 1500ms).
The very last request that fails before requests start succeeding again fails with the error of upstream connect error or disconnect/reset before headers. reset reason: connection termination. After this, requests start working again.
Reproducing
I was able to reproduce the above issue in a similar fashion by doing the following:
Start Envoy Mobile on a physical device connected to LTE
Disable LTE and allow the phone to switch to 4G
Requests start stalling / timing out
After some time, requests will start working again (presumably after the upstream connection is shut down)
Hypothesis
Since we observe reachability and switch clusters based on preferred networks within Envoy Mobile, we previously assumed that we'd be using the correct WWAN/WLAN cluster for requests.
However, when we switch preferred networks, the cluster change is done lazily - i.e., we don't switch existing requests over to the new cluster (known issue in https://github.com/lyft/envoy-mobile/issues/541), and only route new requests over the preferred cluster. More importantly, we don't shut down clusters or restart them when the preferred network changes.
I believe that this leads us to send requests over a dead connection when the device switches between WiFi access points or different cellular technologies (3G/4G/LTE). This is due to the fact that we don't restart connections when these switches occur.
Proposed solution
Totally open to suggestions, but the first thing that comes to mind as a potential solution is to be more aggressive about terminating the current preferred network's connection when we detect a change via reachability observation (rather than simply sending new requests over the preferred cluster and assuming its connection is good). This is something that would potentially also solve https://github.com/lyft/envoy-mobile/issues/541 to some extent.
Problem
(Continuing a discussion that started in Slack here.)
We received some reports of the Lyft alpha app stalling when using Envoy Mobile when the following happens:
onError
being called with upstream request timeout errorsDuring this time, the client is making requests every few seconds, and each one fails after timing out (
15s
later since the timeout specified is1500ms
).The very last request that fails before requests start succeeding again fails with the error of
upstream connect error or disconnect/reset before headers. reset reason: connection termination
. After this, requests start working again.Reproducing
I was able to reproduce the above issue in a similar fashion by doing the following:
Hypothesis
Since we observe reachability and switch clusters based on preferred networks within Envoy Mobile, we previously assumed that we'd be using the correct WWAN/WLAN cluster for requests.
However, when we switch preferred networks, the cluster change is done lazily - i.e., we don't switch existing requests over to the new cluster (known issue in https://github.com/lyft/envoy-mobile/issues/541), and only route new requests over the preferred cluster. More importantly, we don't shut down clusters or restart them when the preferred network changes.
I believe that this leads us to send requests over a dead connection when the device switches between WiFi access points or different cellular technologies (3G/4G/LTE). This is due to the fact that we don't restart connections when these switches occur.
Proposed solution
Totally open to suggestions, but the first thing that comes to mind as a potential solution is to be more aggressive about terminating the current preferred network's connection when we detect a change via reachability observation (rather than simply sending new requests over the preferred cluster and assuming its connection is good). This is something that would potentially also solve https://github.com/lyft/envoy-mobile/issues/541 to some extent.