ios: connections hang when switching between networks

rebello95 commented 5 years ago

Problem

(Continuing a discussion that started in Slack here.)

We received some reports of the Lyft alpha app stalling when using Envoy Mobile when the following happens:

User opens the app on wifi and makes some requests
User switches to cellular (i.e., by getting into an elevator)
Requests start failing with onError being called with upstream request timeout errors
Exactly 2 minutes after requests start failing, requests start succeeding again

During this time, the client is making requests every few seconds, and each one fails after timing out (15s later since the timeout specified is 1500ms).

The very last request that fails before requests start succeeding again fails with the error of upstream connect error or disconnect/reset before headers. reset reason: connection termination. After this, requests start working again.

Reproducing

I was able to reproduce the above issue in a similar fashion by doing the following:

Start Envoy Mobile on a physical device connected to LTE
Disable LTE and allow the phone to switch to 4G
Requests start stalling / timing out
After some time, requests will start working again (presumably after the upstream connection is shut down)

Hypothesis

Since we observe reachability and switch clusters based on preferred networks within Envoy Mobile, we previously assumed that we'd be using the correct WWAN/WLAN cluster for requests.

However, when we switch preferred networks, the cluster change is done lazily - i.e., we don't switch existing requests over to the new cluster (known issue in https://github.com/lyft/envoy-mobile/issues/541), and only route new requests over the preferred cluster. More importantly, we don't shut down clusters or restart them when the preferred network changes.

I believe that this leads us to send requests over a dead connection when the device switches between WiFi access points or different cellular technologies (3G/4G/LTE). This is due to the fact that we don't restart connections when these switches occur.

Proposed solution

Totally open to suggestions, but the first thing that comes to mind as a potential solution is to be more aggressive about terminating the current preferred network's connection when we detect a change via reachability observation (rather than simply sending new requests over the preferred cluster and assuming its connection is good). This is something that would potentially also solve https://github.com/lyft/envoy-mobile/issues/541 to some extent.

rebello95 commented 4 years ago

@goaway started a follow up discussion here, which should address this problem: https://github.com/envoyproxy/envoy/issues/9231

goaway commented 4 years ago

May be mitigated by https://github.com/lyft/envoy-mobile/pull/614

envoyproxy / envoy-mobile