alexmojaki / s3-stream-upload

Manages streaming of data to AWS S3 without knowing the size beforehand and without keeping it all in memory or writing to disk.
MIT License
208 stars 62 forks source link

StreamTransferManager got stuck when part upload fails #27

Closed gkolakowski-ias closed 4 years ago

gkolakowski-ias commented 4 years ago

I used StreamTransferManager with the following properties.

new StreamTransferManager(s3Config.getBucket(), key, client)
                .numStreams(1)
                .numUploadThreads(1)
                .queueCapacity(2)
                .partSize(10);

If a part upload fails, STM may block forever on putting an element into the queue:

34526 [main] INFO  a.m.s3upload.StreamTransferManager - Initiated multipart upload to test-bucket/filename.gz with full ID bcf38beb-3f7c-4107-9a2b-c39211853fc0 
34528 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Creating [MultipartOutputStream for parts 1 - 10000] 
34553 [main] DEBUG a.m.s.ConvertibleOutputStream - Splitting stream of size 15735064 into parts with sizes 10492184 and 5242880 
34564 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 1 containing 10.01 MB] on queue 
34565 [pool-1-thread-1] DEBUG a.m.s3upload.StreamTransferManager - [Manager uploading to test-bucket/filename.gz with id bcf38beb-...211853fc0]: Uploading [Part number 1 containing 10.01 MB] 
34570 [main] DEBUG a.m.s.ConvertibleOutputStream - Splitting stream of size 15730248 into parts with sizes 10487368 and 5242880 
34933 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 2 containing 10.00 MB] on queue 
34973 [main] DEBUG a.m.s.ConvertibleOutputStream - Splitting stream of size 15730248 into parts with sizes 10487368 and 5242880 
34976 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 3 containing 10.00 MB] on queue 
34992 [pool-1-thread-1] ERROR a.m.s3upload.StreamTransferManager - Aborting [Manager uploading to test-bucket/filename.gz with id bcf38beb-...211853fc0] due to error: com.amazonaws.services.s3.model.AmazonS3Exception: Some exception (Service: null; Status Code: 0; Error Code: null; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null 
34992 [pool-1-thread-1] DEBUG a.m.s3upload.StreamTransferManager - [Manager uploading to test-bucket/filename.gz with id bcf38beb-...211853fc0]: Aborting 
35002 [main] DEBUG a.m.s.ConvertibleOutputStream - Splitting stream of size 15730248 into parts with sizes 10487368 and 5242880 
35004 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 4 containing 10.00 MB] on queue 

I also encountered similar problem with the following settings:

new StreamTransferManager(s3Config.getBucket(), key, client)
                .numStreams(1)
                .numUploadThreads(1)
                .queueCapacity(1)
                .partSize(10);

STM blocked on close() method on putting poison pill into queue.

77002 [main] INFO  a.m.s3upload.StreamTransferManager - Initiated multipart upload to test-bucket/filename.gz with full ID c891eae4-51d4-4061-a65d-70f499ff3b45 
77004 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Creating [MultipartOutputStream for parts 1 - 10000] 
77043 [main] DEBUG a.m.s.ConvertibleOutputStream - Splitting stream of size 15735064 into parts with sizes 10492184 and 5242880 
77051 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 1 containing 10.01 MB] on queue 
77053 [pool-1-thread-1] DEBUG a.m.s3upload.StreamTransferManager - [Manager uploading to test-bucket/filename.gz with id c891eae4-...499ff3b45]: Uploading [Part number 1 containing 10.01 MB] 
77056 [main] INFO  a.m.s3upload.MultiPartOutputStream - Called close() on [MultipartOutputStream for parts 1 - 10000] 
77056 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Putting [Part number 2 containing 9.07 MB] on queue 
77057 [main] DEBUG a.m.s3upload.MultiPartOutputStream - Placing poison pill on queue for [MultipartOutputStream for parts 1 - 10000] 
77145 [pool-1-thread-1] ERROR a.m.s3upload.StreamTransferManager - Aborting [Manager uploading to test-bucket/filename.gz with id c891eae4-...499ff3b45] due to error: com.amazonaws.services.s3.model.AmazonS3Exception: Some exception (Service: null; Status Code: 0; Error Code: null; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null 
77145 [pool-1-thread-1] DEBUG a.m.s3upload.StreamTransferManager - [Manager uploading to test-bucket/filename.gz with id c891eae4-...499ff3b45]: Aborting 

I recreated issue by adding:

        if (part.getPartNumber() == 1) {  // part number is not crucial to recreate the issue.
            throw new AmazonS3Exception("Some exception");
        }

in alex/mojaki/s3upload/StreamTransferManager.java:552

gkolakowski-ias commented 4 years ago

This jstack output part might be useful as well:

"main" #1 prio=5 os_prio=31 tid=0x00007f9296809000 nid=0xf03 waiting on condition [0x00007000074c8000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000078d0bb158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at alex.mojaki.s3upload.MultiPartOutputStream.putCurrentStream(MultiPartOutputStream.java:129)
        at alex.mojaki.s3upload.MultiPartOutputStream.checkSize(MultiPartOutputStream.java:110)
        at alex.mojaki.s3upload.MultiPartOutputStream.write(MultiPartOutputStream.java:145)
alexmojaki commented 4 years ago

I followed your steps but I can't reproduce this. I just get an exception and a traceback as expected. Can you make a branch that reproduces the problem when running tests? Or does it only happen with real S3? Does it happen consistently or only here and there?

Make sure your version is up to date, the line numbers in your jstack don't match the code in master.

gkolakowski-ias commented 4 years ago

I guess the reason for the problem is the fact, that the data producer does not run in a separate thread. I had something like this.

final StreamTransferManager manager = getStreamTransferManager(outputKey);
try (final FileInputStream fis = new FileInputStream(inputPath.toFile());
     final MultiPartOutputStream mpos = manager.getMultiPartOutputStreams().get(0)) {
        IOUtils.copy(fis, mpos);
}

When I moved copying to a Runnable which is submitted to a ExecutorService it works fine.

Make sure your version is up to date, the line numbers in your jstack don't match the code in master.

I used the lastest version. The code lines do not match since I modified the code a bit, that is, I added exception in uploadStreamPart()

alexmojaki commented 4 years ago

Here is a branch in which I have tried to follow your steps: https://github.com/alexmojaki/s3-stream-upload/compare/stuck-test?expand=1

I put the exception where you said. No matter what else I do, it just results in an ugly traceback. It never gets stuck. It's all running in one thread.

uploadStreamPart is in StreamTransferManager.java. I'm referring to this in your stack:

    at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
    at alex.mojaki.s3upload.MultiPartOutputStream.putCurrentStream(MultiPartOutputStream.java:129)

which is off by one because queue.put is called on line 128.

gkolakowski-ias commented 4 years ago

@alexmojaki Not enough parts are created in the test. testTransferManager(1000000); generates exactly 2 parts which is equal is to the queue size. Please use testTransferManager(10000000); (* 10, that is, 20 parts are generated) and it gets blocked.

"main" #1 prio=5 os_prio=31 tid=0x00007fb7e4808800 nid=0x1003 waiting on condition [0x00007000043c9000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c07e3cc8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at alex.mojaki.s3upload.MultiPartOutputStream.putCurrentStream(MultiPartOutputStream.java:128)
        at alex.mojaki.s3upload.MultiPartOutputStream.checkSize(MultiPartOutputStream.java:110)
        at alex.mojaki.s3upload.MultiPartOutputStream.write(MultiPartOutputStream.java:143)
        at alex.mojaki.s3upload.MultiPartOutputStream.write(MultiPartOutputStream.java:148)
        at alex.mojaki.s3upload.test.StreamTransferManagerTest.testTransferManager(StreamTransferManagerTest.java:197)
        at alex.mojaki.s3upload.test.StreamTransferManagerTest.testTransferManager(StreamTransferManagerTest.java:156)
alexmojaki commented 4 years ago

Thanks, that did the trick.

I've updated the branch - please take a look and tell me if it seems like a reasonable solution. Basically the queue is 'closed' when the upload is aborted to prevent producers trying to write to the stream. My Java is fading away and my concurrency has never been very good, I might be missing something obvious.

gkolakowski-ias commented 4 years ago

What if a producer is much faster than a consumer? The producer may be already blocked on queue.put() when the consumer throws AmazonS3Exception.

I think producer needs to periodically call queue.offer() with some small but reasonable timeout until offer() returns success or closed flag is true. WDYT?

alexmojaki commented 4 years ago

Great idea. I've opened a proper PR in #28, take a look.

alexmojaki commented 4 years ago

Fix released in 2.2.1, upgrade once it becomes available.