Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.56k stars 2.78k forks source link

BlobStorageClient: Failed to upload blob exception Connection timeout to host https://<storage>.blob.core.windows.net/<container>/<blob_name>/<file> #36638

Closed mani3887 closed 1 month ago

mani3887 commented 2 months ago

Describe the bug A clear and concise description of what the bug is. We have deployed our application in containers/pods in our self managed K8s clusters running on Ubuntu VMSS. We have enabled managed identity for the storage account resources in these VMSS. We are using the below command to retrieve the credentials: cred = ManagedIdentityCredential(client_id=os.getenv('managed_identity_client_id'), logging_enable=True) We have provided permissions for this managed identity to the storage account. Our application is deployed in 32 pods, and uploading 40K+ files in 1-2 hours to storage account(blob storage) and insert in table storage using these creds. For the first hour of this flow, we are able to see the uploads working as intended.

We see 2 exceptions after that:

  1. BlobStorageClient: Failed to upload blob exception Connection timeout to host https://.blob.core.windows.net///

In our code, we create the service client in the blob storage class self.blob_service_client = BlobServiceClient(self.__blob_storage_account_url, credential=cred)

And we have the below code to upload the blobs: async def upload_blob(self, container_name, blob_name, config, metadict): blob_content_details = None try: async with self.blob_service_client.get_blob_client(container=container_name, blob=blob_name) as blob_client: if metadict is not None: metadict = {str(k): (str(v) if v is not None else '') for k, v in metadict.items()} else: return None blob_content_details = await blob_client.upload_blob(data=config, overwrite=True, metadata=metadict) self.logger.debug("BlobStorageClient: Successfully uploaded blob for {} {}".format(container_name, blob_name)) return blob_content_details['version_id'] except Exception as e: self.logger.error("BlobStorageClient: Failed to upload blob {} with blob details {} with exception {}".format(blob_name, blob_content_details, e)) return None

  1. BlobStorageClient: Failed to upload blob with exception Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:5c5f3f49-301e-00a0-196e-def57c000000 Time:2024-07-25T08:38:55.7432808Z ErrorCode:AuthenticationFailed authenticationerrordetail:Request date header too old: 'Thu, 25 Jul 2024 08:07:10 GMT' Content: <?xml version="1.0" encoding="utf-8"?>AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:5c5f3f49-301e-00a0-196e-def57c000000 Time:2024-07-25T08:38:55.7432808ZRequest date header too old: 'Thu, 25 Jul 2024 08:07:10 GMT'

We have another issue where we are unable to retrieve the managed identity credential, we suspect this particular exception is occurring as a result of that. Is our understanding correct for this case? We have created a separate bug(https://github.com/Azure/azure-sdk-for-python/issues/36637) for this issue.

To Reproduce Steps to reproduce the behavior:

  1. Deploy 32 pods of this application. Continuously upload 40K+ files in 1-2 hours in azure blob storage. First 40 mins works fine, but after that it throws this error: BlobStorageClient: Failed to upload blob exception Connection timeout to host https://.blob.core.windows.net///

Expected behavior Blob uploads should work as expected.

github-actions[bot] commented 2 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

kashifkhan commented 2 months ago

Thank you for the feedback @mani3887 . We will investigate and get back to you asap.

mani3887 commented 2 months ago

I fixed the managed identity credential issue but I still receive this error:

BlobStorageClient: Failed to upload blob with blob details None with exception Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:453b5ab0-301e-00a0-366e-dff57c000000 Time:2024-07-26T15:15:25.5550530Z ErrorCode:AuthenticationFailed authenticationerrordetail:Request date header too old: 'Fri, 26 Jul 2024 14:59:37 GMT' Content: <?xml version="1.0" encoding="utf-8"?>AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:453b5ab0-301e-00a0-366e-dff57c000000 Time:2024-07-26T15:15:25.5550530ZRequest date header too old: 'Fri, 26 Jul 2024 14:59:37 GMT'

Is there a way to fix this issue?

mani3887 commented 1 month ago

Hi, Is there a fix for this issue? I am receiving a lot of connection timeout errors for both blob and table storage that are present in the same storage account. I have also added the azure logs. I am using a singleton object for both table storage and blob storage clients. Any pointers on why I could be seeing this issue would really help!

jalauzon-msft commented 1 month ago

Hi @mani3887, these types of errors, both the network ones and the Request date header too old, are usually caused by an overloaded client machine causing network timeouts and slowness. I would look into the health of your client (network usage, CPU, memory) under peak load to see what may be causing the issue.

The Request date header too old errors are caused by the service receiving a request that is older than 15 minutes, which they reject. The SDK adds the Date header to the request at the time it is created so network timeouts/slowness and retries can cause a request to become stale. It seems like you are using the correct pattern for the SDK implementation, a singleton client/etc. so it is an issue with you client.

mani3887 commented 1 month ago

Thanks for replying in this thread. I see a lot of "Connection timeout to host" errors: I suspect this issue is happening because we are getting the blob client first and then calling the upload option. We want to basically only upload the stream of data onto the blob. We have 20K blobs.

Currently this is what we have in our code: try: async with self.blob_service_client.get_blob_client(container=container_name, blob=blob_name) as blob_client: blob_content_details = await blob_client.upload_blob(data=config, overwrite=True, metadata=metadict

I see a lot of calls are going to get_blob_client on the Azure storage blob metric. My suspicion is that the connection is timing out because this happens regularly.

In a different sub process, we are also intensively using azure table storage.

We are running 20 pods in K8s cluster, we run Fast API and each pod receives around 10 requests at max a given time; for each these requests, we do 104 for table store and 104 for blob storage. I see "connection timeout on Peer", so I was thinking it was due to some throttling that was happening. I thought of directly uploading the content to the blob instead of getting the blob and then uploading it, so that we can reduce the calls to get blob client for each request. Your thoughts?

@jalauzon-msft

mani3887 commented 1 month ago

Thanks @jalauzon-msft for our offline conversation. I finally got it working by doing the below things:

  1. Increasing the connection pool to 50 for both table and blob storage. from azure.core.pipeline.transport import AioHttpTransport transport = AioHttpTransport(connection_limit=50) self.blob_service_client = BlobServiceClient(self.__blob_storage_account_url, credential=cred, connection_timeout=30, transport=transport)

  2. Retry of 3 times for both table and blob storage.

These 2 steps reduced the connection issues significantly.

But I still saw few connection timeout issues, I increased the mem and CPU limits for my container.