Closed yan-hic closed 3 years ago
Thanks for taking the time to open an issue.
The retry window starts at the time when the first call to the decorated function is initiated. After a call to the decorated function ends, because the function throws,
@retry
re-throws.@retry
picks a delay to wait before making the next call.@retry
logs the line you observed and re-throws immediately.logs the attempts for the individual calls but gives up after total runtime + random delay > 240s
Yes, the retry window is the window of time (wall clock time) that starts when the first call is initiated. As the readme puts it:
retry_window_after_first_call_in_seconds
is the maximum number of seconds after the first call was initiated, where we would still do a new attempt.
It does not refer to the time spent waiting.
as the deadline should apply to the retries for a given call, not to the runtime.
Time spent in the decorated function also counts. This is by design. If the decorated function supports a timeout (for example, because it makes an http call), then the retry window should be larger than the timeout, otherwise you can end up in this situation where after the first call the retry window is already over.
A good retry window to timeout ratio depends a bit on your situation, but in our codebase we usually start with a retry window 3× or 4× the timeout, to ensure that there is room for 2 or 3 retries after the initial call. We also prefer an aggressive timeout with more attempts, over longer timeouts with fewer attempts, because request durations tend to be fast at the 50th percentile, but they can be slow in the 95th percentile. Rather than waiting even longer in an already unlucky case, and wasting the retry window, we prefer to retry early and hope to be less unlucky in that attempt.
set the value to a very large duration so retries does not time out e.g. 86400 (1 day).
Note that the retry window also affects the delay between retries. The delay is computed such that if every call failed instantly, the time spent waiting in between the max_calls
calls is at most equal to the retry window. Due to jitter, the expected time waiting is half of the retry window. So if you set the retry window to a very high value, the wait between attempts will also become longer.
Does this make sense?
Closing since retry_window_after_first_call_in_seconds
is working as intended.
The intended behavior - I think - is to time out the attempts for a given API call e.g. try 4 times but raise error if last attempt is beyond 60s.
However this is not how it's working currently, at least with
retry_async
. The value ofretry_window_after_first_call_in_seconds
is currently the total runtime.To illustrate,
retry_window_after_first_call_in_seconds
= 240retries
logs the attempts for the individual calls but gives up after total runtime + random delay > 240s.The error
Next attempt would be after retry deadline. No point retrying.
is misleading/incorrect as the deadline should apply to the retries for a given call, not to the runtime.A debug output shows:
Not sure what the random delay was here but runtime (duration) being 200s, if the value was >40, retries would raise the underlying error.
Current (bad) workaround would be to set the value to a very large duration so retries does not time out e.g. 86400 (1 day).