envoyproxy / envoy-mobile

Client HTTP and networking library based on the Envoy project for iOS, Android, and more.
https://envoymobile.io
Apache License 2.0
562 stars 84 forks source link

ios: connections hang when switching between networks #601

Open rebello95 opened 5 years ago

rebello95 commented 5 years ago

Problem

(Continuing a discussion that started in Slack here.)

We received some reports of the Lyft alpha app stalling when using Envoy Mobile when the following happens:

During this time, the client is making requests every few seconds, and each one fails after timing out (15s later since the timeout specified is 1500ms).

The very last request that fails before requests start succeeding again fails with the error of upstream connect error or disconnect/reset before headers. reset reason: connection termination. After this, requests start working again.

Reproducing

I was able to reproduce the above issue in a similar fashion by doing the following:

Hypothesis

Since we observe reachability and switch clusters based on preferred networks within Envoy Mobile, we previously assumed that we'd be using the correct WWAN/WLAN cluster for requests.

However, when we switch preferred networks, the cluster change is done lazily - i.e., we don't switch existing requests over to the new cluster (known issue in https://github.com/lyft/envoy-mobile/issues/541), and only route new requests over the preferred cluster. More importantly, we don't shut down clusters or restart them when the preferred network changes.

I believe that this leads us to send requests over a dead connection when the device switches between WiFi access points or different cellular technologies (3G/4G/LTE). This is due to the fact that we don't restart connections when these switches occur.

Proposed solution

Totally open to suggestions, but the first thing that comes to mind as a potential solution is to be more aggressive about terminating the current preferred network's connection when we detect a change via reachability observation (rather than simply sending new requests over the preferred cluster and assuming its connection is good). This is something that would potentially also solve https://github.com/lyft/envoy-mobile/issues/541 to some extent.

rebello95 commented 4 years ago

@goaway started a follow up discussion here, which should address this problem: https://github.com/envoyproxy/envoy/issues/9231

goaway commented 4 years ago

May be mitigated by https://github.com/lyft/envoy-mobile/pull/614