durch / rust-s3

Rust library for interfacing with S3 API compatible services
MIT License
519 stars 198 forks source link

New `_put_object_stream_with_content_type` doesn't work for larger files. #346

Closed robinfriedli closed 4 weeks ago

robinfriedli commented 1 year ago

Describe the bug The new implementation of _put_object_stream_with_content_type introduced in 0.33.0 creates various issues for larger files. The implementation now loads all chunks and then tries to upload them in parallel, this means:

I rolled back to 0.32 for now, which immediately fixes both of these issues.

To Reproduce Upload a large file (2GB or larger)

Expected behavior put_object_stream should not load the entire file into memory (or it should at least offer a separate function or parameter that doesn't) and should not fail for larger files

Environment

durch commented 4 weeks ago

Should be fixed in 0.34

lackner-codefy commented 3 weeks ago

@durch I'm not OP, but also affected by this issue. Are you sure that it's fixed? From a first look, the code still seems to load all chunks into memory and send unlimited requests in parallel to S3. Will do some more thorough testing when upgrading the dependency.

robinfriedli commented 3 weeks ago

Haven't tried it yet, but the linked commit does change it so that it waits for a previous chunk to be uploaded when there are three or more chunks being uploaded at the same time. If there are 3 or more upload handles, it will wait for an upload to complete before starting the next one (and before reading the next chunk). It should kinda fix this, but it seems like a weird solution. If the reader is very slow, it would probably still result in the same problem where the second upload does not receive any data for too long and time out.

I don't really understand the point of collecting multiple chunks to upload them in parallel to begin with. In almost all cases, the bottleneck is upload bandwidth. Uploading multiple chunks at once doesn't make it faster, just like downloading all files at once isn't faster than downloading one after the other, provided that the bandwidth is bottlenecked on your end. You'll just split bandwidth between chunks instead of uploading each one at full bandwidth.

Why not just upload each chunk as soon as it is read, just like before? When uploading a file from disk, reading each chunk is much faster than uploading it, so I can kinda see the idea behind reading several chunk quickly and uploading them at same time (though again, I doubt there is much benefit to that because it will just make each individual upload slower). But in my use case, the reader is a file upload to my server which is then streamed to S3. That uploader might be on terrible internet, in which case reading the chunk would be much slower than uploading them. In this case this approach makes even less sense. In my mind, the only way uploading chunks in parallel makes sense, is if they are also read in parallel, which they are not and my case can't be.

If there is a use case for uploading chunks in parallel that I'm not seeing, I at least think there should be a separate function that goes back to the original behaviour and simply uploads each chunk after reading it to avoid unnecessary complexity that isn't needed and may even break other use cases.

Another thing I noticed, what happens if the file size is a multiple of CHUNK_SIZE? chunk.len() < CHUNK_SIZE would be false for the last chunk, so done = false and the loop would not terminate. Will the next iteration read and upload 0 bytes? Would that cause an error?

lackner-codefy commented 3 weeks ago

@robinfriedli Note that the linked commit isn't in master, it's my attempt to fix this bug, and still located in a separate repo.

Re uploading sequentially or in parallel: For a sufficiently fast connection, there is probably no benefit in uploading in parallel. I'm not sure if there is a benefit for slow connections (e.g., when TCP ACKs are delayed it might impact speed?), thats why I kept the logic in there for now. But you're right, if it isn't actually necessary for performance reasons, the whole function can be simplified quite a bit.

You're also right that the logic regarding chunk.len() < CHUNK_SIZE is probably incorrect. I'm not sure if it will fail or just do an unnecessary 0 byte upload, but I'll check that and update my branch if necessary.

robinfriedli commented 3 weeks ago

@lackner-codefy oh true, I didn't even check if it was merged. That makes no sense then, I see no change that would fix this.