aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.49k stars 4.11k forks source link

Tuneable "aggressive" timeout for slow S3 requests per "Performance Design Patterns" in S3 User Guide #8288

Open phs opened 11 months ago

phs commented 11 months ago

Describe the feature

Please give us a way to track and aggressively retry slower operations when using concurrent range requests to download single large objects ("multi-part download"), in the manner advised by the S3 user guide.

Use Case

In my EC2 instance, I'd like to use awscli to quickly fetch a small number (4) of objects (about 10 GiB in total size) from a single S3 bucket in the same region.

Ideally I'd like to saturate the instance's inbound network bandwidth (bursting up to 12.5 Gb/s) to get the job done in a few seconds, however a minute would do. This latency is on a critical path for bootstrapping the instance in an auto scaling scenario; other options for getting data onto the instance have been ruled out for independent reasons.

My objects have been uploaded using multipart upload and I've experimented with setting threshold and chunk size to 16, 32, 64 or 128 MiB. On download, I set the same parameters as well as max concurrency to values like 16, 32 or 64. On download I'm connecting to a regional endpoint.

What I find is my download proceeds quickly, typically reaching speeds between 150 to 250 MiB/s. That's good, but it's still nowhere near the (1600 MiB/s) instance burst bandwidth limit. The process is not limited on instance network throughput. Downloading to /dev/null produces the same result to also rule out e.g. disk write throughput.

The bottleneck appears to be either in the S3 client, or upstream in the service. On repeated attempts in a loop we do in fact see improvements, as my object chunks make their way into hotter caches.

Looking for ideas I went to the page linked above, and realized I had not yet considered aggressively retrying laggard requests as it suggests. If I watch the progress meter on download, it does indeed begin strong, and deteriorates over time as the client runs out of chunks to fetch while waiting for the slow ones. I suspect eagerly retrying slow connections might recoup 10-15% of the latency in my scenario. Since I don't seriously expect to ultimately saturate my instance's network, this would still be an interesting win.

Looking in the documentation for awscli, and ultimately at the source code for botocore and s3transfer, I could not find where I might set a "chunk request timeout" or the percent of concurrent requests to retry.

Proposed Solution

The policy mentioned on the page seems reasonable to me:

For latency-sensitive applications, Amazon S3 advises tracking and aggressively retrying slower operations. When you retry a request, we recommend using a new connection to Amazon S3 and performing a fresh DNS lookup.

When you make large variably sized requests (for example, more than 128 MB), we advise tracking the throughput being achieved and retrying the slowest 5 percent of the requests. When you make smaller requests (for example, less than 512 KB), where median latencies are often in the tens of milliseconds range, a good guideline is to retry a GET or PUT operation after 2 seconds. If additional retries are needed, the best practice is to back off. For example, we recommend issuing one retry after 2 seconds and a second retry after an additional 4 seconds.

If your application makes fixed-size requests to Amazon S3, you should expect more consistent response times for each of these requests. In this case, a simple strategy is to identify the slowest 1 percent of requests and to retry them. Even a single retry is frequently effective at reducing latency.

Defining "slowest" might be tricky, but I'm interested in the multipart upload/"download" scenario where all chunks have the same, known size. Projected chunk download time perhaps?

How the policy above is expressed in configuration doesn't worry me particularly, so long as it can be quickly dropped into the config file like other tuning parameters. If this behavior appeared but the parameters were hard-wired, that would probably also be fine.

Other Information

No response

Acknowledgements

CLI version used

aws-cli/1.29.75 Python/3.10.12 Linux/6.5.4-76060504-generic botocore/1.31.75

Environment details (OS name and version, etc.)

Linux

tim-finnigan commented 11 months ago

Hi @phs thanks for reaching out. I brought up your feature request for discussion with the team, and one suggestion they had was to try setting your preferred_transfer_client to crt, and adjusting the target_bandwidth as documented here: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html. You would need to install v2 of the CLI for access to these features.

(As noted in the documentation these features are currently considered experimental, but could be worth trying for your use case.)

Please let us know if that improves the performance, and if there are any more data points you can share on the transfer speed.

github-actions[bot] commented 11 months ago

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

phs commented 11 months ago

Hello, yes. I'm excited to try the idea, I should be able to get to it today or tomorrow.

phs commented 11 months ago

So the great news is my large objects are now downloading to /dev/null at roughly 700 MiB/s!

The bad news is that appears to be the case regardless of the target_bandwidth setting; my 1600MB/s is apparently being ignored.

My chunk size is currently still 128 MiB, meaning my largest file has roughly 50 chunks. I'm going to see what happens if I drop it back down to 32 MiB. EDIT: With 32 MiB chunks, we get the same result. But that's still quite good!

phs commented 11 months ago

Even at 700 MiB/s, in practice I'm struggling to achieve that write throughput to ensure disk is not the bottleneck (my aggregate throughput to EBS combined with instance storage caps out around 650 MiB/s) so I think we're good here.

I do have one piece of feedback for the team responsible for the crt client. One of its requirements is to read/write a file in a filesystem, rather than a pipe. This requirement makes sense, since presumably the client is downloading chunks out of order and using something like mmap to write them directly to their target ranges when they arrive.

Aside from technical hurdles in implementation, I can imagine it's not clear to them that the user may want to keep all those chunks in memory (one risks blowing out ram.) Since the sizes involved in my use case permit it, and disk write throughput is precious, I do definitely want to write and hold all those chunks to ram.

That's easy for me to do; I can use a tmpfs (perhaps with a -o size=limit option) to hold downloaded files, and hand them off to e.g. zstd | tar once they finish. The problem is I then need to wait for the downloads to finish before I can get at my data. My download+unpack process, which is now down to just under a minute (thank you!) could probably drop another 30% if I didn't have to wait for that first download to completely finish before starting the decompress. I can and will twiddle the count and sizes of my downloaded files to help pipeline that, but that is getting rather fiddly.

The ask for the crt team is to instead offer an option to hold downloaded chunks in memory (perhaps up to some configured limit) so they can once again stream chunks out to a pipe (in order, once they're available.)