When uploading small files that don't require multi-part uploads and using a reader that doesn't support seeking, the uploader code always allocates at least a 5MB buffer (named partSize in the code) to read data from the buffer.
For example if you upload a 10KB file, 5MB will get allocated.
If you upload 200 10KB files in parallel, ~1GB of memory will be allocated, instead of ~2MB that's needed.
The 5MB limitation comes from the minimal upload size for multipart files, but I don't see a reason to couple it with the buffer that's used for reading from the stream.
Expected Behavior
Only allocate the memory needed for uploading the file if it's under 5MB
Solution 1. Use a LimitReader to read up to partSize for the first read instead of using the pool. If it's below 5MB don't use the pool at all, if it's above revert to old behavior, see proposal here
Solution 2. Drop the pool altogether and let GC manage memory. Always use LimitReader to read up to partSize and let it deal with allocations. The thing is with the pool is that it's not cleared until the upload finishes, so let's say you have concurrency 5 and part size 100MB, that memory will not be cleared until the end of the upload (assuming you have a file bigger than 500MB). Let's say one of the last parts of the upload is slow for some reason, the memory for the already uploaded parts will still be held. The GC should be smart enough (I hope) to re-use buffers and clear them if another part of the process needs them more.
Additional Information/Context
Our current workaround is to do:
// Non seeker reader
allData, err := io.ReadAll(r)
if err != nil {
return err
}
// Pass this to uploader
readerSeeker := bytes.NewReader(allData)
But that has the downside of allocating a large byte array for large files
Acknowledgements
go get -u github.com/aws/aws-sdk-go-v2/...
)Describe the bug
This is basically the same as https://github.com/aws/aws-sdk-go-v2/issues/1302, but I think there's a possible optimization to be done to improve memory usage.
When uploading small files that don't require multi-part uploads and using a reader that doesn't support seeking, the uploader code always allocates at least a 5MB buffer (named
partSize
in the code) to read data from the buffer. For example if you upload a 10KB file, 5MB will get allocated. If you upload 200 10KB files in parallel, ~1GB of memory will be allocated, instead of ~2MB that's needed.The 5MB limitation comes from the minimal upload size for multipart files, but I don't see a reason to couple it with the buffer that's used for reading from the stream.
Expected Behavior
Only allocate the memory needed for uploading the file if it's under 5MB
Current Behavior
Memory consumption is unnecessarily high
Reproduction Steps
https://github.com/aws/aws-sdk-go-v2/issues/1302 has those in detail
Possible Solution
Solution 1. Use a
LimitReader
to read up topartSize
for the first read instead of using the pool. If it's below 5MB don't use the pool at all, if it's above revert to old behavior, see proposal hereSolution 2. Drop the pool altogether and let GC manage memory. Always use
LimitReader
to read up topartSize
and let it deal with allocations. The thing is with the pool is that it's not cleared until the upload finishes, so let's say you have concurrency 5 and part size 100MB, that memory will not be cleared until the end of the upload (assuming you have a file bigger than 500MB). Let's say one of the last parts of the upload is slow for some reason, the memory for the already uploaded parts will still be held. The GC should be smart enough (I hope) to re-use buffers and clear them if another part of the process needs them more.Additional Information/Context
Our current workaround is to do:
But that has the downside of allocating a large byte array for large files
AWS Go SDK V2 Module Versions Used
Compiler and Version used
go version go1.22.2 darwin/arm64
Operating System and version
MacOS 14.5 (23F79)