Closed olegrog closed 10 years ago
:+1: I really hate that ever since I switched to AWS CLI I had to start dealing with temporary files. Using mkfifo
is a workaround and the streaming files in and out should be natively supported.
Thanks for the feedback, we'll certainly be looking into this. I think this would be a great feature to have. This is going to require some internal restructuring to our code with respect to assumptions we're making uploading multiple parts in parallel, but it should be doable.
Thank you for the detailed answer. Looking forward to this feature.
+1. Any update?
:+1: this would be a great addition to the library.
+1
+1. Any update?
I am currently working on an implementation that is able to perform streaming uploads and downloads (with multipart and multithreaded capabilities). I am hoping to get a pull request sent in 1 to 3 weeks.
Feature added via https://github.com/aws/aws-cli/pull/903. Closing the issue.
It looks like that feature covers single files. I was hoping to have streaming for entire subdirectories. Should another request be opened?
How would you stream a directory? The only way to have multiple files in a single stream is to bundle them in something like a tarball. Otherwise how will you differentiate the end of one file from the beginning of the next?
Or what would your use case be if you do not mind me asking?
That's right. For my use case multiple computers dump output into multiple files, but the data is text with newlines -- I don't need any way of differentiating file starts/ends. Think Hive or certain types of ETL processing. Sure, they could all be part of one giant file, but separated, data can be subset much more easily.
Makes sense. You are correct. It only covers single files. To get the initial s3 streaming implementation into the CLI, we decided to just handle single cp
(both multithreaded and multipart) files for now. Since it is not possible to stream multiple files into s3, this would be a feature for only downloading s3 objects as a stream correct? I suggest you open a new issue for the feature, and we will look into it.
I know I'm nobody special here, but I would vote against such a feature. I think this is an edge case which very few people would want, and it can be easily implemented via the shell.
( for file in $(aws s3 ls s3://bucket/ | awk '{ print $NF }'); do aws s3 cp "s3://bucket/$file" -; done ) | program_that_reads_the_input
You can also add --recursive
to the ls
command to recurse an entire directory, which would emulate the --recursive
parameter for cp
.
I just tested this in EC2 and the upload uses a massive amount of memory. For instance, uploading a 9 GB file it reads the almost the entire stream into memory, topping out around 6.5 GB of real memory usage. My guess is that the queue limiting is not working correctly. I can open a separate issue though.
It's great to see this in the aws-cli finally, though. It's an insanely useful feature, and the lack of it is the reason I wrote https://github.com/rlmcpherson/s3gof3r in the first place.
Hello. Are you planning to provide ability to stream I/O for S3 files? For example, I want to process a very big file and upload it back to S3 storage. Something like this:
It takes much time and space to download file completely and then process it completely too. As far as I know
aws s3api get-object
can help provide streaming:But multi-part upload accepts only blob objects, not pipes.