Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.61k stars 2.82k forks source link

Retries doesn't help against timeouts on downloads #34919

Open mattiasb opened 7 months ago

mattiasb commented 7 months ago

Describe the bug

When experiencing timeouts while downloading a blob with Azure Python SDK the whole download will fail when the network is brought back to normal even if there are retries left for the retry_policy.

I have not been able to reproduce this with uploads.

To Reproduce

See https://github.com/mattiasb/blob-issue for more details on how to reproduce (including an example script).

Steps to reproduce the behavior:

  1. Start downloading a sufficiently large blob using this script.
  2. After a little while induce a really large artificial network latency to your network device. For example with this script.
  3. Wait for two rounds of retries to trigger
  4. Reset the network to normal
  5. Watch the download fail even though the retry policy has retries left and the network is back to normal.

Expected behavior

I would expect the retry policy to trigger also on timeouts, such that when I returned the network to normal a later set of retries would let the script finish downloading the file.

Additional context

Again. The best is to see what I do at https://github.com/mattiasb/blob-issue. But I recorded some logs as well which might be worth looking at.

github-actions[bot] commented 7 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

kashifkhan commented 7 months ago

Thanks for the feedback @mattiasb . We will investigate and get back to you asap.

jalauzon-msft commented 6 months ago

Hi @mattiasb, sorry for the delay. I believe what you are encountering is that there is actually a separate (naive and undocumented) retry policy for larger downloads that use response streaming. In the internal download code there is a retry loop when trying to stream the response body of each chunk. This retry policy has its own retry count and does not interact with the global retry policy. From the stacktrace in your logs, you can see the error is coming from here.

This retry policy is a side-effect of the way response streaming in handled by azure-core and the way the SDK is setup to download the blob in chunks. Streaming the body from the server comes after the initial HTTP response and therefore is not run through the normal transport pipeline. Without this, any error while streaming would cause the whole download to fail.

So, what is likely happening is you are exhausting this retry policy and the failure is bubbling up but is not an error that is handled by the global retry policy and therefore it is possible your request can fail without all the retry attempts being exhausted.

We realize this is not the most ideal implementation, especially since the internal retry count cannot be adjusted and we would like to improve this in the future, but for now this is the likely cause for what you are seeing. In real-world applications, we have found this policy to be effective in mitigating random network interruptions so you may not run into any issues in a production environment. Or for now, you can also consider implementing your own retry over the download call.

mattiasb commented 6 months ago

Or for now, you can also consider implementing your own retry over the download call.

As a data point, this is what we have done as we do experience issues in production code.