Rate limit on retry (x-envoy-attempt-count or anything else)

yfouquet commented 1 year ago

Context

There are currently three main controls around retries:

The first one is define the retry policy at the route level (see config). We there set the amount of retries per request, the back-off delay, the response code to retry on, ...
The second mechanism is to define a limit of concurrent requests in the circuit breaker, at the cluster level (see config)
The last mechanism is to use rate limit to limit the entire traffic (see config)

The challenge with the 3 mechanism above is that they don't protect from retry storm when the amount of customers is large. Let's consider that one service is called by 10,000 servers. All these servers send limited around of traffic - let's say 10req/sec in average. Finally, the configuration allows for 2 retries, the timeout for each retry is 40ms and the total is 100ms. If the upstream service suddenly becomes unavailable, a retry storm will suddenly start:

the circuit breaker will allow at least 1 retry for each server i.e. 10,000 concurrent retries. Because of the low timeout, this means many tens of thousands of retries per second.
the rate limiter will not help either - we would usually put the client-side rate limit to 50req/sec and allowing those retries. We can rate limit based on many things in the request, source and upstream cluster - but we cannot rate limit attempt count only or the fact that it is a retry request. Rate limiting header is good, but x-booking-attempt-count is adding by envoy and we cannot rate limit based on that.

Needs

We would need to add something in RateLimit.Action to rate limit retries only

eightnoteight commented 1 year ago

We came across a similar problem at zomato; the issue is retry budget is based on concurrency rather than requests; although concurrency is an ideal attribute to use as the latency changes in real-time, it also makes the retry budget pointless in most scenarios where per container, upstream concurrency is very low that even 20% might be lowered to 0 retries. As a temporary solution, we implemented it at the application level to solve this but still solving this at the envoy level would be great. At the application level, we observe throughput over a decent-sized window, i.e. 10 seconds. Based on this, we decide the retry quota for the next window based on the budget. For example, if the retry budget is 20% and in the first 10-second window, the container initiated 100 requests (10 req/sec), then for the next 10-second window, the interceptor will allow at max 20 retries. Because the implementation uses previously observed throughput, even if latency changes and concurrency is a bit different, the number of retries will still be conservative.

This implementation could be improved, though, rather than tumbling windows using a sliding window algorithm and applying jitter in how we pick the window start time so that sudden degradation won’t trigger a sudden retry storm in a small window.

Something like this can be implemented in the envoy as a separate filter.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

envoyproxy / envoy

Rate limit on retry (x-envoy-attempt-count or anything else) #24937

Context

Needs