envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.02k stars 4.82k forks source link

Rate limit on retry (x-envoy-attempt-count or anything else) #24937

Closed yfouquet closed 1 year ago

yfouquet commented 1 year ago

Context

There are currently three main controls around retries:

The challenge with the 3 mechanism above is that they don't protect from retry storm when the amount of customers is large. Let's consider that one service is called by 10,000 servers. All these servers send limited around of traffic - let's say 10req/sec in average. Finally, the configuration allows for 2 retries, the timeout for each retry is 40ms and the total is 100ms. If the upstream service suddenly becomes unavailable, a retry storm will suddenly start:

Needs

We would need to add something in RateLimit.Action to rate limit retries only

eightnoteight commented 1 year ago

We came across a similar problem at zomato; the issue is retry budget is based on concurrency rather than requests; although concurrency is an ideal attribute to use as the latency changes in real-time, it also makes the retry budget pointless in most scenarios where per container, upstream concurrency is very low that even 20% might be lowered to 0 retries. As a temporary solution, we implemented it at the application level to solve this but still solving this at the envoy level would be great. At the application level, we observe throughput over a decent-sized window, i.e. 10 seconds. Based on this, we decide the retry quota for the next window based on the budget. For example, if the retry budget is 20% and in the first 10-second window, the container initiated 100 requests (10 req/sec), then for the next 10-second window, the interceptor will allow at max 20 retries. Because the implementation uses previously observed throughput, even if latency changes and concurrency is a bit different, the number of retries will still be conservative.

This implementation could be improved, though, rather than tumbling windows using a sliding window algorithm and applying jitter in how we pick the window start time so that sudden degradation won’t trigger a sudden retry storm in a small window.

Something like this can be implemented in the envoy as a separate filter.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.