envoyproxy / envoy-mobile

Client HTTP and networking library based on the Envoy project for iOS, Android, and more.
https://envoymobile.io
Apache License 2.0
560 stars 84 forks source link

Request Filters for Request's Retries #782

Open Augustyniak opened 4 years ago

Augustyniak commented 4 years ago

Proposal

Allow EnvoyMobile request filters to be run for each retry request it performs.

Introduction

Currently, EnvoyMobile's request filter chain is run only once for any network request it performs. This is true even for requests with multiple retries.

Let's say that we have a retry policy that allows for up to 3 retries of a request and mark these attempts using 0, 1, 2, 3 numbers. Now, before a request is performed EnvoyMobile allows us to modify it using registered filters. We can modify the request once before the attempt 0 is made and allow EnvoyMobile to perform attempts 1, 2 and 3 as needed without being able to modify requests that are made as part of these retries.

Issue

At Lyft, we work on extending our mobile fault injection capabilities. For this reason, we work actively on ingesting 'fault injection HTTP headers' into random network requests our mobile applications perform in order to understand the behavior of our apps in degraded server and/or connectivity conditions. Fault injection HTTP headers are just special HTTP headers supported by Envoy that's used by Lyft's server infrastructure. They are documented here and they include the following headers: x-envoy-fault-request-abort, x-envoy-fault-delay-request and x-envoy-fault-response-limit.

What is explained below is true for all of these headers but let's look at x-envoy-fault-request-abort headers specifically because its example outlines the issue we are dealing with the best. Let's say that we have a request v1/foo and we want to check how our application behaves in cases where 50% of requests of this type failing with 400 HTTP status code.

We can use EnvoyMobile filter chain to add x-envoy-fault-abort-request HTTP header and set its value to 400. The problem is that we cannot specify that these HTTP headers should be added to 50% of outgoing requests only - we can either not add it to a request at all or add it and accept the fact that it's going to be added to the original request and all of its retries.

Going back to our example, we want to simulate 50% failure rate with 400 status code for 50% of v1/foo network requests and our default retry policy allows for up to 3 retries of any request. With the current capabilities of EnvoyMobile we can add x-envoy-fault-abort-request: 400 HTTP header to outgoing network request (with 50% chance of it being added) but in the end, we end up with 4 attempts of this request failing with 400 status code since each of the retries of the request contains x-envoy-fault-abort-request: 400 HTTP header.

This makes it impossible for us to test scenarios in which only a portion of attempts of performing a given request fails with a given status code.

Augustyniak commented 4 years ago

Another example when being able to run filters for every individual upstream request that Envoy makes would be helpful.

At Lyft, we send x-timestamp-ms HTTP header representing the current client NTP timestamp as part of every upstream request our application makes. With our legacy HTTP stack, we are able to update the value of this header for every upstream request we make, with EnvoyMobile stack we can only set it once for a given downstream request and EnvoyMobile reuses this value for every upstream request associated with a downstream request we started.

With our default configuration, we wait for up to 15 seconds for every upstream request to finish before it timeouts and we attempt to perform it again. For requests that support multiple retries we need to keep updating the value of x-timestamp-ms for every upstream request or it becomes stale.

With some requests, it's totally possible that the number of upstream requests we perform for a given request reaches numbers as high as 20 for when a user has a weak internet connection or there is an outage of one of our services. With these hypothetical 20 retries, we could end up with the value x-timestamp-ms HTTP header being off by 20 * 15seconds (upstream request timeout) = 300 seconds when we use EnvoyMobile that doesn't allow us to update the value of x-timestamp-ms HTTP header of upstream requests it performs.