Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.53k stars 2.76k forks source link

Azure Mgmt DNS does not handle throttling gracefully #37002

Open elidhu opened 3 weeks ago

elidhu commented 3 weeks ago

https://github.com/Azure/azure-sdk-for-python/blob/e4e8351eb9523bff0e6c4597478d04b9ceb9c14c/sdk/network/azure-mgmt-dns/azure/mgmt/dns/v2018_05_01/operations/_record_sets_operations.py#L1095C1-L1110C49

I believe the client should handle throttling by respecting the Retry-After header, it seems that this section of the code does not, there is no way to prevent this from raising an exception when being throttled, even the RetryPolicy settings don't seem to be respected.

Is there something we can do here?

elidhu commented 3 weeks ago

We do have a lot of records in a single zone. From what I can gather here

We tried to implement a long backoff in a RetryPolicy, but this has no effect, I believe the codepath circumvents that

xiangyan99 commented 2 weeks ago

Thanks for the feedback, we’ll investigate asap.

msyyc commented 2 weeks ago

Python SDK already has default retry policy. For further investigation, could you follow https://github.com/Azure/azure-sdk-for-python/wiki/Common-issues-about-Python-SDK#debug-guide to share code and detailed logs?

elidhu commented 2 weeks ago

Hi @msyyc - Thank you for taking a look.

~I am already using the RetryPolicy, and I don't believe it is taking effect. I suspect the code in the SDK is actually circumventing the RetryPolicy because it is accessing self._client._pipeline~ **I have taken a deeper look, and although it doesn't seem to be taking effect, the code path doesn't circumvent the RetryPolicy

For reference I am using the following RetryPolicy configuration:

    client_retry_policy:
      total_retries: 6
      # backoff_value = 2**(retry_number - 1) * backoff_factor => 20, 40, 80, 160, 320, 640
      backoff_factor: 20
      # min(backoff_max, backoff_value) => E.g min(640, 320) = 320 when retry_number = 5
      backoff_max: 640

Just another thought, If the RetryPolicy was actually working, wouldn't I get an error about something else? not throttling?

Also, yes I will share logs as soon as I can!

elidhu commented 2 weeks ago

@xiangyan99 @msyyc

So I did some further debugging finally I subclassed the RetryPolilcy so I could check that it was working, not all of these overrides are necessary, however I will just past the exact code I used.

class MyRetry(RetryPolicy):
    def is_retry(self, settings, response):
        """Checks if method/status code is retryable.

        Based on whitelists and control variables such as the number of
        total retries to allow, whether to respect the Retry-After header,
        whether this header is present, and whether the returned status
        code is on the list of status codes to be retried upon on the
        presence of the aforementioned header.

        The behavior is:
        -   If status_code < 400: don't retry
        -   Else if Retry-After present: retry
        -   Else: retry based on the safe status code list ([408, 429, 500, 502, 503, 504])

        :param dict settings: The retry settings.
        :param response: The PipelineResponse object
        :type response: ~azure.core.pipeline.PipelineResponse
        :return: True if method/status code is retryable. False if not retryable.
        :rtype: bool
        """
        if response.http_response.status_code < 400:
            return False
        has_retry_after = bool(response.http_response.headers.get("Retry-After"))
        logging.info("has_retry_after: %s", has_retry_after)
        if has_retry_after and self._respect_retry_after_header:
            logging.info("Respecting Retry-After header")
            return True
        if not self._is_method_retryable(
            settings, response.http_request, response=response.http_response
        ):
            logging.info("Method not retryable")
            return False

        is_retryable = (
            settings["total"]
            and response.http_response.status_code in self._retry_on_status_codes
        )

        logging.info("is_retryable: %s", is_retryable)

        return is_retryable

    def sleep(self, settings, transport, response=None):
        """Sleep between retry attempts.

        This method will respect a server's ``Retry-After`` response header
        and sleep the duration of the time requested. If that is not present, it
        will use an exponential backoff. By default, the backoff factor is 0 and
        this method will return immediately.

        :param dict settings: The retry settings.
        :param transport: The HTTP transport type.
        :param response: The PipelineResponse object.
        :type response: ~azure.core.pipeline.PipelineResponse
        """
        if response:
            slept = self._sleep_for_retry(response, transport)
            if slept:
                logging.error(
                    "Used Retry-After header when _respect_retry_after_header is: %s",
                    self._respect_retry_after_header,
                )
                return
        self._sleep_backoff(settings, transport)

    def _sleep_backoff(self, settings, transport):
        """Sleep using exponential backoff. Immediately returns if backoff is 0.

        :param dict settings: The retry settings.
        :param transport: The HTTP transport type.
        """
        backoff = self.get_backoff_time(settings)
        logging.info("backoff: %s", backoff)
        if backoff <= 0:
            return
        transport.sleep(backoff)

The important part of the above is that the sleep method on the policy ALWAYS uses the Retry-After header if it is present. Meaning that regardless of anything else set on the policy, if the API returns a Retry-After then it doesn't matter what you do.

I even set the _respect_retry_after_header attribute to False, it is however also not used in the sleep code path.

This API also always as far as I can tell returns Retry-After: 10

So in my opinion at least, the bug is here:

        if response:
            slept = self._sleep_for_retry(response, transport)
            if slept:
                logging.error(
                    "Used Retry-After header when _respect_retry_after_header is: %s",
                    self._respect_retry_after_header,
                )
                return
        self._sleep_backoff(settings, transport)

And maybe the solution is to sleep for the max(retry_after, backoff) or something like this etc.

elidhu commented 2 weeks ago

Just some more thoughts:

I guess you could argue that always respecting the Retry-After header is the sensible thing to do, but in this case the API endpoint seems to be just setting it to an arbitrary value, which after a few retries, causes the throttling exception to bubble up.

If the API set either, a higher value or, the real time to wait, then that would be fine.

msyyc commented 2 weeks ago

I hope you could share the log of interaction between client and service server by following https://github.com/Azure/azure-sdk-for-python/issues/37002#issuecomment-2309162801. Maybe service doesn't follow retry policy guideline.

github-actions[bot] commented 6 days ago

Hi @elidhu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.