Closed jabbera closed 3 years ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.
Hi @jabbera Thanks for the feature request. We will add this to the backlog.
Thanks Ashish
Bump I've been trying to find a way for the python api to stream, and I haven't seen a way to do it yet, or maybe I missed it. The chunks() method on the api documentation doesn't have any description, so I'm thankful someone figured this out.
Thanks @jabbera for this request. I too was struggling to figure out this download_blob()
functionality.
@amishra-dev Any update when this will be as part of official SDK?
Hi @jabbera and @shahdadpuri-varun, sorry on the delay on this. This feature is already available in the SDK.
Once you call download_blob()
you can then call chunks()
on the StorageStreamDownloader
handle. So for example:
blob_stream = bc.download_blob()
for chunk in blob_stream.chunks():
# Insert what you want to do here.
I will close this issue for now. Let me know if you have any other questions! We intend to improve our docstrings to highlight this feature.
Hi @jabbera, thanks for the example code, it's really helped me out.
To say "This feature is already available in the SDK" is over selling it a little. So many things in Python expect a streamable or "File Like" object that they can read from. @jabbera summed it up in one of their comments in the example code above: "Treat a StorageStreamDownloader as a forward read only stream". Implementing your own chunking is great but it seems like a cruel joke on developers to not have the StorageStreamDownloader function like a stream. If StorageStreamDownloader had a read() method it could be passed straight into other python libraries far more easily.
For example at the moment if I want to copy files from my Azure blob store to a partner's AWS S3 I would expect to be able to do:
az_blob_stream = azure.download_blob() # This is a StorageStreamDownloader object
aws.upload_fileobj(az_blob_stream)
In reality this requires doing:
az_blob_stream = azure.download_blob()
aws.upload_fileobj(az_blob_stream.read_all())
Which loads the entire file into memory and then uploads it to AWS S3. After finding this issue and using the provided example code I can do:
az_blob_stream = azure.download_blob()
real_stream = AzureBlobStream(az_blob_stream)
aws.upload_fileobj(real_stream)
Which is almost (but not quite!) the first example and notably uploads and downloads at the same time, using significantly less memory for large files.
I missed that this was closed. @ollytheninja reiterated my point. This SDK is striving to be as pythonic as possible and that chunks api is about as far from pythonic as possible. StorageStreamDownloader, despite its name, is not a python stream.
@tasherif-msft
Once you call
download_blob()
you can then callchunks()
on theStorageStreamDownloader
handle. So for example:
If you look at my sample code I use the api you described here so I know it exists. The issue is StorageStreamDownloader isn't an actual python stream so it's useless across 99 percent of the python io ecosystem unless you want to download the entire blob into memory. (Hint, we don't want one of those fancy 200TB blobs you just released sitting in ram if we are copying it somewhere:-))
@ollytheninja & @jabbera ,
I'm facing the same issue. Function works well as long as filesize remains < App Service Plan RAM size. At ~90% memory utilization, the Function crashes.
Oddly enough, my use case is the same too: Move data from Azure Storage to AWS S3 bucket!
max_single_get_size
has been a nice performance improvement; reduced Blob download times.blob_client = BlobClient.from_blob_url(event.get_json()["blobUrl"], credentials, max_single_get_size = 256*1024*1024, max_chunk_get_size = 128*1024*1024)
blob_data = blob_client.download_blob().readall()`
blob_bytes = io.BytesIO(blob_data)
config
screams!
config = boto3.s3.transfer.TransferConfig(multipart_threshold=1024*25, max_concurrency=10, multipart_chunksize=1024*25, use_threads=True)
s3.Bucket(s3_bucket).upload_fileobj(blob_byte_stream, Key = aws_dir, Config = config)
I think you're proposing something like this, no? Passing ("streaming") chunks end to end across the wire?
I'm pretty new to Python so trying to determine from your sample code above whether you were successful or whether this is not possible.
Please enlighten me!
That's basically what my function does. It will basically keep at most 2x chunk size in memory at one time.
Correct, currently it's doing what you illustrated on the first line - while it does the fetching of the file in chunks it doesn't expose those chunks as a stream, meaning that you cannot process the file in a streaming fashion, it will pull down the entire file before passing it on. The culprit is that readall()
call
This is especially confusing when the "StorageStreamDownloader" returns a file-like python object and not a stream-like python object.
Exposing a stream like object, that buffers two chunks and fetches another when the first starts being read means processing a file uses [3*chunksize] and not [filesize] memory.
This is not only useful for the example here of transferring files out to another provider but also when (for example) processing frames in a video, searching large log files etc.
If I get some time I'll see about making a pull request and a new issue for this, @tasherif-msft what are your views on reopening this? Or should we raise a new issue?
@jabbera and @ollytheninja ,
Thanks for the continued engagement on the topic, though I'm still a little unclear.
@jabbera , sounds like you are saying that your code does not load the entire file into memory, but rather max 2 chunks and passes them completely down the line (sounds like streaming).
Whereas @ollytheninja , you are saying ...it will pull down the entire file before passing it on.
Is it actually possible to accomplish the second illustration above:
GET chunk(s) --> | process chunck(s) in memory | --> POST chunk(s) to external provider
? My code only keeps 1 chunk in a buffer and whatever your read size is.
Ok cool. Have you tried / are you able to POST chunks-out as part of say, an upload to S3 without holding the entire file in memory?
No. I use it to stream really large text compressed text files (30-40GB compressed), decompress, and parse into a more usable format.
Could we get some more samples in the docs for iterating over chunks()
?
The section is empty at the moment:
How can I increase the BUFFER_SIZE for the chunks? Is there any docs?
I'm facing a similar situation, Thanks if anyone who could fix this :)
I'm shocked this is still open. Native python stream functionality should be core to this library.
Hi all, apologies that this thread has gone quiet for some time. It is true that chunks
is still currently our supported way to stream data back without loading into memory. There is also readinto
available now which you can look into which may help in certain scenarios.
@virtualdvid David, you can control the buffer size for chunks using the max_chunk_get_size
keyword argument on all client constructors.
https://github.com/Azure/azure-sdk-for-python/blob/073c3e88b679261960e6aa62123d3206524e7478/sdk/storage/azure-storage-blob/azure/storage/blob/_blob_client.py#L130-L131
That being said, a little while ago I did start the work to add a proper read
method to StorageStreamDownloader
but have been too busy to finish it up. I hope to get back to that soon and have it in in the next couple of releases. Here is the PR in Draft form: #24275
@jalauzon-msft I am trying to access the blob file, divide it into the chunks of 1kb files and upload them in another folder. For that, I used download_blob() and then chunks() method. As you mentioned, in that case it divides the file into 4 mb files by default. Instead, I tried to use "max_chunk_get_size" argument with value 1024 and received the following error:
- TypeError: chunks() got an unexpected keyword argument 'max_chunk_get_size' Where could I find appropriate argument for chunk size?
I resolved the issue from the previous comment by adding "max_chunk_get_size" argument in BlobService Client:
I'm shocked this is still open. Native python stream functionality should be core to this library.
Could someone update us on the status of this issue?
Hi @ericthomas1, #24275 was recently merged which added a standard read
method to the StrageStreamDownloader
class. This will allow you to read an arbitrary size chunk of data from the downloader so the data can be streamed in a more Pythonic way. This is currently released in our latest beta version, 12.14.0b1 and 12.14.0b1. The plan is for this to be in our next full release which is tentatively scheduled for early this month.
In the meantime, or as an alternative, the chunks
API, which exists today, can be used to stream data back. See this sample for how that can be done. Thanks.
Is your feature request related to a problem? Please describe. I have large gz files I need to stream (10-50GB). I don't want or have the memory to download the blob into memory first. gz is a streaming format so I only need chunks at a time.
Describe the solution you'd like Something like this. Note the AzureBlobStream implementation that only keeps 1 chunk in memory at a time. It would be nice if StorageStreamDownloader just acted like a stream and behaved this way.
Describe alternatives you've considered Downloading 10's of GB into memory.
Additional context None.
Edit: updated code to work....