aws / aws-sdk-java-v2

The official AWS SDK for Java - Version 2
Apache License 2.0
2.2k stars 853 forks source link

Multi-part parallel upload via InputStreams #3960

Open vikasvb90 opened 1 year ago

vikasvb90 commented 1 year ago

Describe the feature

Currently, transfermanager supports multi-part parallel upload only using file as an input. If an InputStream is provided then multi-part upload becomes sequential where all parts are lined up in sequence and uploads are performed. This is a proposal to provide support to accept multiple InputStreams each emitted for a specific portion of content and to perform concurrent uploads where each stream is responsible for the transfer of a specific part. At the end of the upload, complete upload request is triggered to merge the uploaded parts.

Use Case

There are scenarios where pre-processing of content may be required before its transfer. Let's say that this pre-processing work can be carried out independently for different parts of the content. Now, to carry out both pre-processing and transfer of content, we have two options available. Either content is read, pre-processed and transferred serially using InputStream or content is first pre-processed completely by pre-processing different parts concurrently and then making the transfer as a separate operation. In first approach, we avoid duplicate reads but end up making the whole process latent and in second approach we lose streaming nature of the content and end up reading the content twice.

Proposed Solution

What we need here is the ability to apply the package of all streaming operations on a unit of content vs applying individual operation on the whole content. Multiple InputStreams emitted for different parts of content can solve this problem where each stream applies all streaming operations like pre-processing, transfer, post-processing on a unit of content. This can be achieved by assigning an InputStream to a PutObjectRequest request in the multi-part upload process and carrying out the transfer in parallel.

Other Information

We were able to achieve multi-part parallel upload using streams for async uploads using S3AsyncClient. We were also able to customize v1 transfermanager and achieve the same there as well. We haven't got a chance to accommodate these changes in v2 transfermanager though. Following is a reference PR of async upload : https://github.com/opensearch-project/OpenSearch/pull/7217/files#diff-a472eda8bf5f051224172440b745603502d3545f8326b28d6014a79d5a833cd5

An iterable pattern is used in the class to create a stream for each part. Stream stream = streamContext.getStreamProvider().provideStream(partIdx)

Acknowledgements

AWS Java SDK version used

2.20.26

JDK version used

11

Operating System and version

OSx

zoewangg commented 1 year ago

Apologies for the delayed response.

If an InputStream is provided then multi-part upload becomes sequential where all parts are lined up in sequence and uploads are performed.

Reading from the input stream is sequential but uploading could be concurrent. Under the hood, AWS CRT-based S3 client buffers the part and whenever a part is available, it will send it, so it is possible that multipart parts are being uploaded at the same time.

This can be achieved by assigning an InputStream to a PutObjectRequest request in the multi-part upload process and carrying out the transfer in parallel.

This is an interesting idea, but how would the SDK know how many parts are available. I guess the SDK could take a list of inputStream and infer the number of the size of the list. Alternatively, we may be able to provide better support for seekable AsyncRequestBody, say SeekableAsyncRequestBody, that allows the SDK to read data from any offset just like reading from a file. This way, we may be able to achieve high throughput.

vikasvb90 commented 1 year ago

@zoewangg Can you please explain a bit more on how you are thinking of seekable content providing higher throughput? You mentioned it as an alternate idea so are you proposing to build on top of single seekable InputStream? This might not work in all cases because underlying decorators of the provided stream might depend on the specific offsets from which content should be read. One example is frame encryption where smallest unit of the content becomes a frame and if a frame is processed partially then it will produce corrupted content. I believe that control of providing streams (sequential) should still lie with callers.

zoewangg commented 8 months ago

This is possible now with multipart S3 client, i.e,. S3AsyncClient.builder().multipartConfigurationeEnabled(true).build(), but you'd need to implement your own AsyncRequestBody and overridesplit method that returns the sub inputstream in the form of AsyncRequestBody.

https://github.com/aws/aws-sdk-java-v2/blob/65f7554f829c85c3ec26425d79beb263ce275cc0/core/sdk-core/src/main/java/software/amazon/awssdk/core/async/AsyncRequestBody.java#L472

You can check out the example we have in FileAsyncRequestBody#split which enables the SDK to read at different offsets at the same time.

https://github.com/aws/aws-sdk-java-v2/blob/master/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/async/FileAsyncRequestBody.java#L84

denizk commented 8 months ago

Sorry for chiming in. @zoewangg can you clarify if the Note comment below applies here or not?

When you're using a stream for the source of data, the TransferManager class does not do concurrent uploads. https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingTheMPJavaAPI.html

I assume this is only true if not using the CRT client.