aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.58k stars 4.13k forks source link

Inefficient CPU usage of `aws s3 cp` #2791

Open jklontz opened 7 years ago

jklontz commented 7 years ago

This issue arises primarily when copying a list of files from S3. My understanding is that the suggested approach is to copy the files individually by invoking aws s3 cp, for example:

cat files.txt | parallel aws s3 cp s3://remote/path/{/} /local/path/{/}

In our experience, for files under about 1MB (we work with large image datasets), copy time is CPU bound by the aws process.

By comparison, the following script which serves as an alternative to aws s3 cp takes about 1/10th the CPU usage by my measurements (and thus downloads files much faster):

#!/bin/bash
contentType="text/html; charset=UTF-8" 
date="`date -u +'%a, %d %b %Y %H:%M:%S GMT'`"
string="GET\n\n${contentType}\n\nx-amz-date:${date}\n${1}"
signature=`echo -en $string | openssl sha1 -hmac "${AWS_SECRET_KEY}" -binary | base64` 
curl -o ${2} -s \
     -H "x-amz-date: ${date}" \
     -H "Content-Type: ${contentType}" \
     -H "Authorization: AWS ${AWS_ACCESS_KEY}:${signature}" \
     "https://s3.amazonaws.com${1}"

Where ${1} is the S3 input path and ${2} is the local output path.

I profiled the aws s3 cp command and it seems that most of the time is spent by the Python interpreter initializing the execution environment. If there can't be anything done to speed this up, it would be helpful to have an aws command to copy a list of files in parallel to avoid re-occurring this compute cost. This would appear possible as aws s3 sync doesn't consume nearly as much CPU, but it doesn't offer an interface suitable for copying a specific list of files.

kyleknap commented 7 years ago

@jklontz I can see that happening especially if parallel is being used because by default because with each invocation of the CLI command, 10 threads will be spun up. Have you tried lowering the number of threads being used by configuring max_concurrent_requests? That may improve cpu usage. The only other option (which is not great) is to use a single a cp --recursive command and use the --exclude and --include parameters to only include files from the list. The big problem with this is that you have to iterate over all of the keys under the prefix still.

Otherwise, I am going to mark this as a feature request. Also noting it is similar to this feature request: https://github.com/aws/aws-cli/issues/2463 where use a bucket manifest as the source for objects to transfer.

jklontz commented 7 years ago

Thanks for the response @kyleknap. I looked into your suggestion on max_concurrent_requests and did not see an improvement when setting it to 1 before using parallel.

Profiling the aws s3 cp command on a single file download, I see that 95% of the CPU usage is spent in the main thread (presumably initializing the application as hypothesized). And then there is one worker thread consuming 5% of the CPU usage (presumably downloading the file).

Unfortunately, cp --recursive with --exclude and --include is not a viable solution for our use case which involves downloading random sets of 10k to 1M files from a directory of >10M files.

ASayre commented 6 years ago

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

jamesls commented 6 years ago

Based on community feedback, we have decided to return feature requests to GitHub issues.

tim-finnigan commented 2 years ago

Checking in as this issue hasn't received any community comments in a few years. There have since been a few articles published related to this topic:

@jklontz have you tried any of the suggestions recommended in those articles? Please let us know if this is still an issue you're seeing and what else you have tried.

jklontz commented 2 years ago

None of these links appear to address the use case of copying a list of files that can't be easily pattern matched with --recursive, --include, or --exclude. We continue to use curl instead of this use case.