alexmojaki / s3-stream-upload

Manages streaming of data to AWS S3 without knowing the size beforehand and without keeping it all in memory or writing to disk.
MIT License
208 stars 62 forks source link

S3 stream function #13

Closed sivagiri closed 5 years ago

sivagiri commented 5 years ago

Hi,

I tried your example but not able to pass any file or stream !! Could you help on this ?? or if you had any good example can share it ??


AmazonS3 s3Client = amazonS3Client();
int numStreams = 4;
final StreamTransferManager manager = new StreamTransferManager(bucketName, fileName, s3Client)
                .numStreams(numStreams).numUploadThreads(2)
                .queueCapacity(2).partSize(100);

manager.getMultiPartOutputStreams();

// Finishing off
manager.complete();```

Thanks...
alexmojaki commented 5 years ago

You get the streams from manager.getMultiPartOutputStreams();, you can't pass your own stream.

sivagiri commented 5 years ago

How could i pass My file or Stream ?? In case of aws I used below one TransferManager tm = getTransferManager(); Upload upload = tm.upload(request);(in this request I sent a File in PutObject) UploadResult ur = upload.waitForUploadResult();

In your Case How I use StreamTransferManager ?? to upload a big file or stream ? to aws

alexmojaki commented 5 years ago
List<OutputStream> streams = manager.getMultiPartOutputStreams();
streams.get(0).write("stuff".getBytes())

You cannot pass your own stream, you must write to the stream from the manager.

If your data is already in a file, there is no point in using this library. This is for avoiding files and doing everything in memory.

sivagiri commented 5 years ago

Hi,

I have 5gb stream not able use your library, where can I send stream ??

On Fri, 17 May 2019 at 3:10 PM, Alex Hall notifications@github.com wrote:

List streams = manager.getMultiPartOutputStreams(); streams.get(0).write("stuff".getBytes())

You cannot pass your own stream, you must write to the stream from the manager.

If your data is already in a file, there is no point in using this library. This is for avoiding files and doing everything in memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alexmojaki/s3-stream-upload/issues/13?email_source=notifications&email_token=ACOLYRMDJOCC2JRUQMZUODTPVZ4RZA5CNFSM4HNTPOPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVUJBTY#issuecomment-493392079, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOLYRP6XASHCWYFYZSIO33PVZ4RZANCNFSM4HNTPOPA .

alexmojaki commented 5 years ago

Provide a lot more code and context. The sample you gave is not very informative. You mentioned a file, but apparently you don't want to use one? Why are you unable to use the streams from manager.getMultiPartOutputStreams()? How is the stream you have created?

sivagiri commented 5 years ago

I have working on it .. currently only one part is uploaded I will share the code ASAP,

On Fri, 17 May 2019 at 4:12 PM, Alex Hall notifications@github.com wrote:

Provide a lot more code and context. The sample you gave is not very informative. You mentioned a file, but apparently you don't want to use one? Why are you unable to use the streams from manager.getMultiPartOutputStreams()? How is the stream you have created?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alexmojaki/s3-stream-upload/issues/13?email_source=notifications&email_token=ACOLYRJYFRTDDJBL7MSTKIDPV2DY3A5CNFSM4HNTPOPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVUNKSY#issuecomment-493409611, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOLYRORLZFBMQNEZWIMVWLPV2DY3ANCNFSM4HNTPOPA .

iceagebuck commented 5 years ago

I am facing same problem. Requirement - Upload multiple large files (upto 10GB) onto AWS S3 without loading into memory or saving onto disc.

Current Setup spring-boot based API which accepts file as multipart. application uses Apache commons-fileupload to extract request content as Stream & form fields. Now, how I can write this stream to MultiPartOutputStream.write()? Converting this into []byte, will load whole stream into memory.


@RequestMapping(value = "/api/upload", method = RequestMethod.POST)
    public String handleUploadWithoutSize(HttpServletRequest request) {
 ServletFileUpload upload = new ServletFileUpload();
            FileItemIterator iterStream = upload.getItemIterator(request);
while (iterStream.hasNext()) {
                FileItemStream item = iterStream.next();

if (!item.isFormField()) {

                    InputStream stream = item.openStream();

                      StreamTransferManagerService.write(Streams.asString(stream).getBytes());

                } else {
                    //process form fields
                }
}

StreamTransferManagerService
//StreamTransferManager configuration for s3 and others

final List<MultiPartOutputStream> streams = manager.getMultiPartOutputStreams();
        //List<StringBuilder> builders = new ArrayList<StringBuilder>(numStreams);
        ExecutorService pool = Executors.newFixedThreadPool(numStreams);
        for (int i = 0; i < numStreams; i++) {
            final int streamIndex = i;
            final StringBuilder builder = new StringBuilder();
           // builders.add(builder);
            Runnable task = new Runnable() {
                @Override
                public void run() {
                    MultiPartOutputStream outputStream = streams.get(streamIndex);
                    for (int lineNum = 0; lineNum < 1000000; lineNum++) {
                        //String line = String.format("Stream %d, line %d\n", streamIndex, lineNum);
                        try {
                            outputStream.write(streamBytes);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        // builder.append(line);
                    }
                    outputStream.close();
                }
            };
            pool.submit(task);
        }
        pool.shutdown();
        pool.awaitTermination(5, TimeUnit.SECONDS);
        manager.complete();
    }

When I tried with 1GB file memory utilization bumped to ~1GB. Process did not complete. I killed server after waiting for 10 minutes

alexmojaki commented 5 years ago

First result of googling "java copy from inputstream to outputstream": https://stackoverflow.com/a/39440936/2482744

In Java 9:

input.transferTo(output);
iceagebuck commented 5 years ago

We are using JDK8. However in this case it just accepts byte[].

I really want to use this library but need a way to pass input stream to output stream without loading memory.

It would be great if you can share sample code on the same.

alexmojaki commented 5 years ago

Here's the source of transferTo, you can use the idea: https://github.com/netroby/jdk9-dev/blob/master/jdk/src/java.base/share/classes/java/io/InputStream.java#L518

Basically you read a few bytes from the input stream and write them to the output stream, and repeat.

iceagebuck commented 5 years ago

thanks for quick reply!!!

It looks like working now. Here is my observations-

  1. "JProfiler" shows blocked thread and used memory is always above 500 MB.
  2. This library took almost 50% more time to upload file(200 MB) to aws s3 compare to aws sdk TransferManager with same setup.

Configuration - numStreams = 1 numUploadThreads=10 queueCapacity=2 partSize=20

Any way to optimize this?

alexmojaki commented 5 years ago

I suspect that the 500MB memory usage is just the JVM preallocating that much memory for the heap, see #2.

If your data is already in a file on disk then there is no point in using this library, use the AWS SDK. This library is for avoiding the file system and keeping everything in memory, which is only sometimes useful.

If the thread that writes to the stream is getting blocked, that means it's producing data faster than it's being uploaded. Try increasing the number of upload threads.

The upload threads are bound to get blocked at the beginning while they wait for initial data. With your current configuration uploading a 200MB (10 * 20MB) file, the only way all the threads will be used is if they all upload exactly one part, which is likely not to happen. If you upload something bigger and the writing thread writes data fast enough, you're more likely to see all the threads get used.

Unless you're uploading something bigger than 50GB, I think you can leave the part size at the default of 5MB. Then the threads can more quickly pick up parts and start uploading them so they'll block less. It'll also reduce memory usage if that's actually a problem.

nditur commented 5 years ago

I'm trying to use this nice library. From time to time the file I'm uploading to s3 is empty. Is there a planning to support such case?

alexmojaki commented 5 years ago

I've opened #21 for empty files. There's no reason to have more discussion in this issue.

alexmojaki commented 5 years ago

@nditur empty files are now supported in version 2.1.0. If you were overriding any of the customise*Request methods, you may want override customisePutEmptyObjectRequest.