Closed yfouquet closed 1 year ago
We came across a similar problem at zomato; the issue is retry budget is based on concurrency rather than requests; although concurrency is an ideal attribute to use as the latency changes in real-time, it also makes the retry budget pointless in most scenarios where per container, upstream concurrency is very low that even 20% might be lowered to 0 retries. As a temporary solution, we implemented it at the application level to solve this but still solving this at the envoy level would be great. At the application level, we observe throughput over a decent-sized window, i.e. 10 seconds. Based on this, we decide the retry quota for the next window based on the budget. For example, if the retry budget is 20% and in the first 10-second window, the container initiated 100 requests (10 req/sec), then for the next 10-second window, the interceptor will allow at max 20 retries. Because the implementation uses previously observed throughput, even if latency changes and concurrency is a bit different, the number of retries will still be conservative.
This implementation could be improved, though, rather than tumbling windows using a sliding window algorithm and applying jitter in how we pick the window start time so that sudden degradation won’t trigger a sudden retry storm in a small window.
Something like this can be implemented in the envoy as a separate filter.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Context
There are currently three main controls around retries:
The challenge with the 3 mechanism above is that they don't protect from retry storm when the amount of customers is large. Let's consider that one service is called by 10,000 servers. All these servers send limited around of traffic - let's say 10req/sec in average. Finally, the configuration allows for 2 retries, the timeout for each retry is 40ms and the total is 100ms. If the upstream service suddenly becomes unavailable, a retry storm will suddenly start:
x-booking-attempt-count
is adding by envoy and we cannot rate limit based on that.Needs
We would need to add something in RateLimit.Action to rate limit retries only