Open elidhu opened 3 weeks ago
We do have a lot of records in a single zone. From what I can gather here
We tried to implement a long backoff in a RetryPolicy, but this has no effect, I believe the codepath circumvents that
Thanks for the feedback, we’ll investigate asap.
Python SDK already has default retry policy. For further investigation, could you follow https://github.com/Azure/azure-sdk-for-python/wiki/Common-issues-about-Python-SDK#debug-guide to share code and detailed logs?
Hi @msyyc - Thank you for taking a look.
~I am already using the RetryPolicy, and I don't believe it is taking effect. I suspect the code in the SDK is actually circumventing the RetryPolicy because it is accessing self._client._pipeline
~ **I have taken a deeper look, and although it doesn't seem to be taking effect, the code path doesn't circumvent the RetryPolicy
For reference I am using the following RetryPolicy configuration:
client_retry_policy:
total_retries: 6
# backoff_value = 2**(retry_number - 1) * backoff_factor => 20, 40, 80, 160, 320, 640
backoff_factor: 20
# min(backoff_max, backoff_value) => E.g min(640, 320) = 320 when retry_number = 5
backoff_max: 640
Just another thought, If the RetryPolicy was actually working, wouldn't I get an error about something else? not throttling?
Also, yes I will share logs as soon as I can!
@xiangyan99 @msyyc
So I did some further debugging finally I subclassed the RetryPolilcy
so I could check that it was working, not all of these overrides are necessary, however I will just past the exact code I used.
class MyRetry(RetryPolicy):
def is_retry(self, settings, response):
"""Checks if method/status code is retryable.
Based on whitelists and control variables such as the number of
total retries to allow, whether to respect the Retry-After header,
whether this header is present, and whether the returned status
code is on the list of status codes to be retried upon on the
presence of the aforementioned header.
The behavior is:
- If status_code < 400: don't retry
- Else if Retry-After present: retry
- Else: retry based on the safe status code list ([408, 429, 500, 502, 503, 504])
:param dict settings: The retry settings.
:param response: The PipelineResponse object
:type response: ~azure.core.pipeline.PipelineResponse
:return: True if method/status code is retryable. False if not retryable.
:rtype: bool
"""
if response.http_response.status_code < 400:
return False
has_retry_after = bool(response.http_response.headers.get("Retry-After"))
logging.info("has_retry_after: %s", has_retry_after)
if has_retry_after and self._respect_retry_after_header:
logging.info("Respecting Retry-After header")
return True
if not self._is_method_retryable(
settings, response.http_request, response=response.http_response
):
logging.info("Method not retryable")
return False
is_retryable = (
settings["total"]
and response.http_response.status_code in self._retry_on_status_codes
)
logging.info("is_retryable: %s", is_retryable)
return is_retryable
def sleep(self, settings, transport, response=None):
"""Sleep between retry attempts.
This method will respect a server's ``Retry-After`` response header
and sleep the duration of the time requested. If that is not present, it
will use an exponential backoff. By default, the backoff factor is 0 and
this method will return immediately.
:param dict settings: The retry settings.
:param transport: The HTTP transport type.
:param response: The PipelineResponse object.
:type response: ~azure.core.pipeline.PipelineResponse
"""
if response:
slept = self._sleep_for_retry(response, transport)
if slept:
logging.error(
"Used Retry-After header when _respect_retry_after_header is: %s",
self._respect_retry_after_header,
)
return
self._sleep_backoff(settings, transport)
def _sleep_backoff(self, settings, transport):
"""Sleep using exponential backoff. Immediately returns if backoff is 0.
:param dict settings: The retry settings.
:param transport: The HTTP transport type.
"""
backoff = self.get_backoff_time(settings)
logging.info("backoff: %s", backoff)
if backoff <= 0:
return
transport.sleep(backoff)
The important part of the above is that the sleep
method on the policy ALWAYS uses the Retry-After
header if it is present. Meaning that regardless of anything else set on the policy, if the API returns a Retry-After
then it doesn't matter what you do.
I even set the _respect_retry_after_header
attribute to False, it is however also not used in the sleep code path.
This API also always as far as I can tell returns Retry-After: 10
So in my opinion at least, the bug is here:
if response:
slept = self._sleep_for_retry(response, transport)
if slept:
logging.error(
"Used Retry-After header when _respect_retry_after_header is: %s",
self._respect_retry_after_header,
)
return
self._sleep_backoff(settings, transport)
And maybe the solution is to sleep for the max(retry_after, backoff)
or something like this etc.
Just some more thoughts:
I guess you could argue that always respecting the Retry-After
header is the sensible thing to do, but in this case the API endpoint seems to be just setting it to an arbitrary value, which after a few retries, causes the throttling exception to bubble up.
If the API set either, a higher value or, the real time to wait, then that would be fine.
I hope you could share the log of interaction between client and service server by following https://github.com/Azure/azure-sdk-for-python/issues/37002#issuecomment-2309162801. Maybe service doesn't follow retry policy guideline.
Hi @elidhu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
https://github.com/Azure/azure-sdk-for-python/blob/e4e8351eb9523bff0e6c4597478d04b9ceb9c14c/sdk/network/azure-mgmt-dns/azure/mgmt/dns/v2018_05_01/operations/_record_sets_operations.py#L1095C1-L1110C49
I believe the client should handle throttling by respecting the Retry-After header, it seems that this section of the code does not, there is no way to prevent this from raising an exception when being throttled, even the RetryPolicy settings don't seem to be respected.
Is there something we can do here?