aws / aws-sdk-go-v2

AWS SDK for the Go programming language.
https://aws.github.io/aws-sdk-go-v2/docs/
Apache License 2.0
2.68k stars 651 forks source link

Upload manager: When using `io.Reader` without seeking support, the uploader over allocates memory leading to high usage #2694

Open erezrokah opened 5 months ago

erezrokah commented 5 months ago

Acknowledgements

Describe the bug

This is basically the same as https://github.com/aws/aws-sdk-go-v2/issues/1302, but I think there's a possible optimization to be done to improve memory usage.

When uploading small files that don't require multi-part uploads and using a reader that doesn't support seeking, the uploader code always allocates at least a 5MB buffer (named partSize in the code) to read data from the buffer. For example if you upload a 10KB file, 5MB will get allocated. If you upload 200 10KB files in parallel, ~1GB of memory will be allocated, instead of ~2MB that's needed.

The 5MB limitation comes from the minimal upload size for multipart files, but I don't see a reason to couple it with the buffer that's used for reading from the stream.

Expected Behavior

Only allocate the memory needed for uploading the file if it's under 5MB

Current Behavior

Memory consumption is unnecessarily high

Reproduction Steps

https://github.com/aws/aws-sdk-go-v2/issues/1302 has those in detail

Possible Solution

Solution 1. Use a LimitReader to read up to partSize for the first read instead of using the pool. If it's below 5MB don't use the pool at all, if it's above revert to old behavior, see proposal here

Solution 2. Drop the pool altogether and let GC manage memory. Always use LimitReader to read up to partSize and let it deal with allocations. The thing is with the pool is that it's not cleared until the upload finishes, so let's say you have concurrency 5 and part size 100MB, that memory will not be cleared until the end of the upload (assuming you have a file bigger than 500MB). Let's say one of the last parts of the upload is slow for some reason, the memory for the already uploaded parts will still be held. The GC should be smart enough (I hope) to re-use buffers and clear them if another part of the process needs them more.

Additional Information/Context

Our current workaround is to do:

// Non seeker reader
allData, err := io.ReadAll(r)
if err != nil {
    return err
}
// Pass this to uploader
readerSeeker := bytes.NewReader(allData)

But that has the downside of allocating a large byte array for large files

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.16.24

Compiler and Version used

go version go1.22.2 darwin/arm64

Operating System and version

MacOS 14.5 (23F79)

lucix-aws commented 5 months ago

Filing as a feature request, will likely be OBE as we move towards a redo of the transfer manager #2649.