Azure / azure-storage-python

Microsoft Azure Storage Library for Python
https://azure-storage.readthedocs.io
MIT License
339 stars 241 forks source link

create_blob_from_bytes is very slow to upload files #533

Closed sushilkm closed 5 years ago

sushilkm commented 5 years ago

Which service(blob, file, queue) does this issue concern?

blob

Which version of the SDK was used? Please provide the output of pip freeze.

azure-storage-blob==1.4.0 azure-storage-common==1.4.0

What problem was encountered?

create_blob_from_bytes uploads the files very slow to azure storage. Tried to compare this with azcopy and a very simple code to upload files(file-size: 243K) code to use python sdk: https://gist.github.com/sushilkm/72370e38c6dfc5e4129b77c5cdd72a26

results for python sdk to upload the file, first time it takes a very long time and subsequent uploads take less but still a lot more time than it should ~150ms, results at https://gist.github.com/sushilkm/947203347a3b1a8d7b131f59e6f635a7 And I am using fairly new Python 3.6.5

Timings for azcopy is https://gist.github.com/sushilkm/2a2c8922d46c2a628a960e78e296d8ec

Have you found a mitigation/solution?

No, but I found that create_blob_from_bytes uses create_blob_from_stream and create_blob_from_bytes transfers a BytesIO stream and then create_blob_from_stream reads again from the stream and sends bytes to _put_blob, so basically there is time wasted to create a stream from bytes and then read bytes again from the stream. This can help to improve timings.

zezha-msft commented 5 years ago

Hi @sushilkm, thanks for reaching out!

I'm just a bit confused, how do you know that AzCopy is faster? The result says:

Elapsed time: 00.00:00:00

sushilkm commented 5 years ago

Yeah my bad, i did not get the milliseconds with azcopy, i did it again with azcopy

I tried collecting more numbers for azcopy I ran this time to get the execution times using shell, https://gist.github.com/sushilkm/7e49c87ac6bc3d9ece8abd056119b8b3 shows that when azcopy tries to copy a single file(i think this is a new connection everytime it tries), this is well below 3 seconds time of python sdk

then did another exercise to upload 5001 files in a directory each of 243K and results are fairly fast https://gist.github.com/sushilkm/c3a6c6a384219d61f65f1f1fa07ad694

zezha-msft commented 5 years ago

Hi @sushilkm, thanks for the additional information. And we appreciate your feedback!

It's interesting that the Python SDK's latency decreased dramatically after the first run, while AzCopy's results are quite consistent. Is it possible that the first run was just an anomaly?

In terms of total time, the Python SDK took about 150ms while AzCopyV8 took ~900ms. So the Python SDK is faster in general? Am I misunderstanding something?

On a side note, my colleagues are currently working on a performance testing framework to benchmark the different SDKs and tools that support Azure Storage. We'll soon be able to compare the Python SDK against the others in a consistent way. I'll make sure to update this thread when results are available.

sushilkm commented 5 years ago

Thanks @zezha-msft for the update on this, I retried python-sdk at some intervals around an hour or so, and results were similar for python-sdk(first hit would take super long) as well as azcopy