Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.53k stars 2.76k forks source link

Frequent ConectionResetErrors using start_copy_from_url() #21870

Closed blawrence-datadx closed 9 months ago

blawrence-datadx commented 2 years ago

Describe the bug I have a Python Azure Function that imports BlobServiceClient from azure.storage.blob and uses start_copy_from_url() to move blob files from a Standard/Hot StorageV2 container to a Standard/Cool StorageV2 container multiple times a day. Most of the time this works fine, but we have been getting more and more ConectionResetErrors every week. This issue used to happen once a month, but is now happening 3-5 times a week. We created a support ticket with Azure support and have made no progress for over two months. We've changed things like making sure our TLS settings match between our storage account and azure function and have updated our extension bundles, but nothing has changed the behavior. We're not sure how it's possible to get a ConectionResetError when the function app is hosted on Azure servers. To Reproduce Steps to reproduce the behavior:

  1. The process is intermittent. We are unable to reproduce the error locally.

Expected behavior Files are archived from one storage account to the other with no errors.

Additional context Example stack trace of the exception:

exception: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/site/wwwroot/Codebases/QBO/etl_qbo_azure_functions.py", line 66, in main archived_files = archive.archive_blobs() File "/home/site/wwwroot/Codebases/shared_code/helpers.py", line 548, in archive_blobs copied_blob.start_copy_from_url(blob_to_copy_url) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_blob_client.py", line 2079, in start_copy_from_url return self._client.blob.start_copy_from_url(**options) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_generated/operations/_blob_operations.py", line 2552, in start_copy_from_url pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 515, in send raise err File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 489, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 290, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/base_client.py", line 333, in send return self._transport.send(request, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 330, in send raise error azure.core.exceptions.ServiceResponseError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
xiangyan99 commented 2 years ago

Thanks for the feedback, we’ll investigate asap.

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details
- **Package Name**: azure-storage-blob - **Package Version**: 12.9.0 - **Operating System**: Linux - **Python Version**: 3.8.12 **Describe the bug** I have a Python Azure Function that imports `BlobServiceClient` from `azure.storage.blob` and uses `start_copy_from_url()` to move blob files from a Standard/Hot StorageV2 container to a Standard/Cool StorageV2 container multiple times a day. Most of the time this works fine, but we have been getting more and more `ConectionResetError`s every week. This issue used to happen once a month, but is now happening 3-5 times a week. We created a support ticket with Azure support and have made no progress for over two months. We've changed things like making sure our TLS settings match between our storage account and azure function and have updated our extension bundles, but nothing has changed the behavior. We're not sure how it's possible to get a `ConectionResetError` when the function app is hosted on Azure servers. **To Reproduce** Steps to reproduce the behavior: 1. The process is intermittent. We are unable to reproduce the error locally. **Expected behavior** Files are archived from one storage account to the other with no errors. **Additional context** Example stack trace of the exception: ```python exception: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/site/wwwroot/Codebases/QBO/etl_qbo_azure_functions.py", line 66, in main archived_files = archive.archive_blobs() File "/home/site/wwwroot/Codebases/shared_code/helpers.py", line 548, in archive_blobs copied_blob.start_copy_from_url(blob_to_copy_url) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_blob_client.py", line 2079, in start_copy_from_url return self._client.blob.start_copy_from_url(**options) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_generated/operations/_blob_operations.py", line 2552, in start_copy_from_url pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 515, in send raise err File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 489, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/policies.py", line 290, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_shared/base_client.py", line 333, in send return self._transport.send(request, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 330, in send raise error azure.core.exceptions.ServiceResponseError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) ```
Author: blawrence-datadx
Assignees: tasherif-msft
Labels: `bug`, `Storage`, `Service Attention`, `Client`, `customer-reported`, `needs-team-attention`
Milestone: -
blawrence-datadx commented 2 years ago

Hi @tasherif-msft, have you been able to look into this issue? It has started happening more and more frequently. I can give you example function invocation IDs if that would be helpful

tasherif-msft commented 2 years ago

Hi @blawrence-datadx thanks for opening the issue. This is a known issue that we are investigating and having discussions across languages to discuss what the best solution is. I will update you once I have more information!

blawrence-datadx commented 2 years ago

Hi @tasherif-msft, this issue has started happening more and more frequently on a daily basis and not just from start_copy_from_url(). This is making the use of Azure Functions unsustainable for us and we're having to rely on retries which is not a clean solution. Please let us know if any progress has been made

jalauzon-msft commented 2 years ago

Hi @blawrence-datadx, Connection resets come from the Storage service. The service team recommends retries as the solution for these types of intermittent failures so that is the best that can be done from the client side. We are aware that some of these errors are not automatically retried from the Storage SDK itself, and we are working to address that.

If you already have retry logic in place and are still experiencing resets or are concerned about the number of resets, I recommend pushing on the Support ticket you have opened or opening another ticket. The service team is really the one that should be looking into the high number of resets.

blawrence-datadx commented 2 years ago

Thank you @jalauzon-msft. I'm afraid we cancelled our support service because after 3 months the service team made 0 progress. Before the connection reset errors were only happening on start_copy_from_url() calls, but now they are happening at random places throughout the code. So the best solution is really to wrap every single blob storage call with retry logic?

amishra-dev commented 2 years ago

@blawrence-datadx can you share your storage account name and the last time period when this happened?

blawrence-datadx commented 2 years ago

Hi @amishra-dev, the storage account name is datadxdatalake and the last time period it happened was Monday 2/14 at 8:41pm PST

jalauzon-msft commented 2 years ago

Hi @blawrence-datadx, the SDK does have built-in retry logic that should be automatically retrying connection reset errors (with the exception of a potential known issue with download_blob()). We apply the following policy to all clients by default: https://github.com/Azure/azure-sdk-for-python/blob/835adb397badbf4ac08c3bb8fffcb24abb075ded/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/policies.py#L536

Ideally these retries should be enough and you should not have to add your own retry logic for every operation. That being said, we have received a number of reports of Connection Reset errors despite this retry logic. We are now currently investigating why these retries do not seem to be sufficient and considering changing the default values.

kiran-jayaram commented 1 year ago

Hello, I have a similar issue, it doesn't show any errors, but the copy doesn't occur. Can you confirm if this issue is resolved?

jalauzon-msft commented 9 months ago

Closing out this old issue.

We have made a couple of improvements in this area over the past couple of releases to ensure we automatically retry connection reset errors as well as the service team has made some improvements on their end. Ultimately connection reset errors are going to occur from time to time when working with such a large service as Azure Storage. The automatic retries built into the SDK should mitigate most of issues that could be caused by this but if consistent connection reset errors are seen for Storage calls you may need to look for other causes, such as exhausted client networking resources, client load, etc. If everything still looks good from the client-side, then the best thing to do is to open a Support ticket to have the service team take a look.