Azure / azure-storage-cpp

Microsoft Azure Storage Client Library for C++
http://azure.github.io/azure-storage-cpp
Apache License 2.0
131 stars 147 forks source link

Azure Blob storage throttling upload blobs #394

Open ajs97 opened 3 years ago

ajs97 commented 3 years ago

Hi, I am facing some performance issues while uploading blobs to Azure blob storage using this particular SDK: I am uploading 64MB sized blobs, experimenting around with various values of parallelism_factor (4/8/16). When I upload around 1GB of data for parallelism = 8/16, I get around 110MBps, but when I increase the total to about 5GB, the total throughput drops to around 50MBps. I checked the intermediate throughput, and I see that for the initial few blobs I get 80-90MBps, but for the subsequent blobs the throughput drops to 40-50MBps, and it even drops down to 20MBps.

Note that I am uploading these blobs sequentially.

Do you know what the possible reason could be for the difference in throughput for total size, and if there is some configuration which would lead to better throughput for large amount of data uploaded?

Note that for my use case, it is important to upload data in 64MB blobs, and the total amount of data uploaded will be in 10s of GBs, and would like to optimize for this use case. Thanks.

ljluestc commented 10 months ago

import asyncio from azure.storage.blob import BlobServiceClient, BlobClient, BlobType

async def upload_blob(container_name, blob_name, data): blob_service_client = BlobServiceClient.from_connection_string("") container_client = blob_service_client.get_container_client(container_name) blob_client = container_client.get_blob_client(blob_name) await blob_client.upload_blob(data, blob_type=BlobType.BlockBlob)

async def main(): container_name = "" total_size = 5 1024 1024 # 5GB blob_size = 64 1024 1024 # 64MB

tasks = []
for i in range(total_size // blob_size):
    blob_name = f"blob_{i}"
    data = b"Your 64MB data here"  # Replace with your actual data
    task = asyncio.ensure_future(upload_blob(container_name, blob_name, data))
    tasks.append(task)

await asyncio.gather(*tasks)

if name == "main": loop = asyncio.get_event_loop() loop.run_until_complete(main())