aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.53k stars 4.13k forks source link

Streaming files in S3 #410

Closed olegrog closed 10 years ago

olegrog commented 11 years ago

Hello. Are you planning to provide ability to stream I/O for S3 files? For example, I want to process a very big file and upload it back to S3 storage. Something like this:

$ aws s3 get s3://bucket/foo | process.sh | aws s3 put s3://bucket/bar

It takes much time and space to download file completely and then process it completely too. As far as I know aws s3api get-object can help provide streaming:

$ mkfifo fifo
$ aws s3api get-object --bucket bucket --key foo fifo
$ cat fifo | process.sh

But multi-part upload accepts only blob objects, not pipes.

nikolay commented 11 years ago

:+1: I really hate that ever since I switched to AWS CLI I had to start dealing with temporary files. Using mkfifo is a workaround and the streaming files in and out should be natively supported.

jamesls commented 11 years ago

Thanks for the feedback, we'll certainly be looking into this. I think this would be a great feature to have. This is going to require some internal restructuring to our code with respect to assumptions we're making uploading multiple parts in parallel, but it should be doable.

olegrog commented 11 years ago

Thank you for the detailed answer. Looking forward to this feature.

lloyd commented 10 years ago

+1. Any update?

mjallday commented 10 years ago

:+1: this would be a great addition to the library.

vickrum commented 10 years ago

+1

alotia commented 10 years ago

+1. Any update?

kyleknap commented 10 years ago

I am currently working on an implementation that is able to perform streaming uploads and downloads (with multipart and multithreaded capabilities). I am hoping to get a pull request sent in 1 to 3 weeks.

kyleknap commented 10 years ago

Feature added via https://github.com/aws/aws-cli/pull/903. Closing the issue.

MegaByte commented 10 years ago

It looks like that feature covers single files. I was hoping to have streaming for entire subdirectories. Should another request be opened?

phemmer commented 10 years ago

How would you stream a directory? The only way to have multiple files in a single stream is to bundle them in something like a tarball. Otherwise how will you differentiate the end of one file from the beginning of the next?

kyleknap commented 10 years ago

Or what would your use case be if you do not mind me asking?

MegaByte commented 10 years ago

That's right. For my use case multiple computers dump output into multiple files, but the data is text with newlines -- I don't need any way of differentiating file starts/ends. Think Hive or certain types of ETL processing. Sure, they could all be part of one giant file, but separated, data can be subset much more easily.

kyleknap commented 10 years ago

Makes sense. You are correct. It only covers single files. To get the initial s3 streaming implementation into the CLI, we decided to just handle single cp (both multithreaded and multipart) files for now. Since it is not possible to stream multiple files into s3, this would be a feature for only downloading s3 objects as a stream correct? I suggest you open a new issue for the feature, and we will look into it.

phemmer commented 10 years ago

I know I'm nobody special here, but I would vote against such a feature. I think this is an edge case which very few people would want, and it can be easily implemented via the shell.

( for file in $(aws s3 ls s3://bucket/ | awk '{ print $NF }'); do aws s3 cp "s3://bucket/$file" -; done ) | program_that_reads_the_input
kyleknap commented 10 years ago

You can also add --recursive to the ls command to recurse an entire directory, which would emulate the --recursive parameter for cp.

rlmcpherson commented 10 years ago

I just tested this in EC2 and the upload uses a massive amount of memory. For instance, uploading a 9 GB file it reads the almost the entire stream into memory, topping out around 6.5 GB of real memory usage. My guess is that the queue limiting is not working correctly. I can open a separate issue though.

It's great to see this in the aws-cli finally, though. It's an insanely useful feature, and the lack of it is the reason I wrote https://github.com/rlmcpherson/s3gof3r in the first place.