cesar-revert commented 7 years ago

I'm getting a bug using the Java SDK AmazonS3 client, and maybe you can provide a fix. This is what happens:

Every minute I generate several CSV files on an S3 bucket (in order to send data to a reporting data warehouse). After several minutes, I need to concatenate these files into a bigger one and place it on shared folder in the bucket where then it gets loaded by the data warehouse. In order to perform the concatenation I don't want to download the files, concatenate them externally, and upload them again, because I'll be wasting a lot of resources and bandwith. So I use S3ObjectInputStreams instead.

On my first approach, I tried the multipart upload (using each single file as a part) but that doesn't work when the file parts are too small. The AWS SDK documentation clearly states that files have to be over 5MB in size to be able to use multipart upload. So this approach is discarded in this case (pity, because I could be as easy as that). So instead I use SequenceInputStream to combine the files, and try to provide the contentLength using the following code (with Java SDK 1.11.164 to 1.11.172):

1 public boolean concatenateFiles(String bucketName, String preffix, String destKey, boolean deleteOrig) {

2 3 ObjectListing objectListing = s3Client.listObjects(bucketName, preffix);

4 5 Vector inputs = new Vector();

6 ArrayList objects = new ArrayList();

7 8 long contentLength = 0;

9 10 while (true) {

11 for (Iterator<?> iterator =

12 objectListing.getObjectSummaries().iterator();

13 iterator.hasNext();) {

14 S3ObjectSummary summary = (S3ObjectSummary)iterator.next();

15 inputs.add(getObjectInputStream(bucketName, summary.getKey()));

16 objects.add(summary.getKey());

17 contentLength += summary.getSize();

18 System.out.println("adding "+ summary.getKey());

19 }

20 // more object_listing to retrieve?

21 if (objectListing.isTruncated()) {

22 objectListing = s3Client.listNextBatchOfObjects(objectListing);

23 } else {

24 break;

25 }

26 };

27 28 Enumeration enu = inputs.elements();

29 SequenceInputStream sis = new SequenceInputStream(enu);

30 31 ObjectMetadata metadata = new ObjectMetadata();

32 metadata.setContentLength(contentLength);

33 s3Client.putObject(new PutObjectRequest(bucketName, destKey, sis, metadata));

34 35 if (deleteOrig) {

36 for (String object : objects) {

37 s3Client.deleteObject(bucketName, object);

38 }

39 }

40 41 return true;

42 }

The fact is this code doesn't work properly... becasuse of the contentLength! When I execute the code above, it only copies the first file (the first InputStream) in the SequenceInputStream and then raises an issue saying that the action may have not been successfully executed and that some data may have not been readen.

If I remove lines #8, #17 and #32 and don't provide the contentLength, the code works fine (a new S3 object is correctly created with the content concatenation of all others), but it raises a warning similar to "No content length specified for stream > data. Stream contents will be buffered in memory and could result in out of memory errors.". Of course I want to avoid memory errors in case files are too big. But the fact is that, by removing the contentLength, the S3 client seems to be prepared to do the job efficiently and reliably. This approach is currently working fine in our production environment, with the only issue that I cannot provide a contentLength, thus forcing the client to work with all file contents buffered in memory... and it could result in out of memory errors when files grow in size.

To sum it up, even when I'm able to provide the contentLength (which is the desired behaviour on the AmazonS3 client), if then I try to use a SequenceInputStream the client won't work correctly, reading only the first InputStream but ignoring all the rest. I believe this issue can be arising because everytime a single InputStream in the SequenceInputStream is readen, the AmazonS3 client asuemes that the operation is completed and it checks the metadata contentLength against the readen data. In this case, obviously when only the first InputStream is readen they won't match, as there are other InputStreams in the sequence.

Probably you could fix that but just checking wheter the InputStream in the PutObjectRequest (line #33) is an instance of SequenceInputStream, and in this case check the contentLength only when all InputStreams in the sequence have been readen.

In any case, if you want to suggest a better way to concatenate files on S3 that would be fine too.

Thank you very much for your time.

millems commented 7 years ago

Could you provide a simplified implementation with the full source we could use to reproduce the problem? It's tricky to reproduce without the full source (eg. What is the implementation for getObjectInputStream)?

Here's a working implementation with 2 objects:

String bucketName = "foo" + UUID.randomUUID().toString().toLowerCase();
s3.createBucket(bucketName);
try {
    s3.putObject(bucketName, "file1", "Text");
    s3.putObject(bucketName, "file2", "Text");
    S3Object file1 = s3.getObject(bucketName, "file1");
    S3Object file2 = s3.getObject(bucketName, "file2");

    try (InputStream file3Content = new SequenceInputStream(file1.getObjectContent(), file2.getObjectContent())) {
        ObjectMetadata file3Metadata = new ObjectMetadata();
        file3Metadata.setContentLength(file1.getObjectMetadata().getContentLength() +
                                       file2.getObjectMetadata().getContentLength());

        s3.putObject(bucketName, "file3", file3Content, file3Metadata);
    }

    IOUtils.copy(s3.getObject(bucketName, "file3").getObjectContent(), System.out);
} finally {
    deleteBucketAndAllContents(bucketName);
}

millems commented 7 years ago

Example with an arbitrary number of files (lazy load the input streams so that they don't get closed before we're done):

String bucketName = "foo" + UUID.randomUUID().toString().toLowerCase();
s3.createBucket(bucketName);
try {
    // Create test data
    s3.putObject(bucketName, "file1", "Text");
    s3.putObject(bucketName, "file2", "Text");
    s3.putObject(bucketName, "file3", "Text");

    // Create lazy Enumeration<InputStream>
    List<String> filesToConcatenate = Arrays.asList("file1", "file2", "file3");
    long newFileSize = filesToConcatenate.stream()
                                         .map(file -> s3.getObjectMetadata(bucketName, file).getContentLength())
                                         .reduce(0L, (l, r) -> l + r);

    Enumeration<InputStream> fileStreams = new Enumeration<InputStream>() {
        private Iterator<String> files = filesToConcatenate.iterator();

        @Override
        public boolean hasMoreElements() {
            return files.hasNext();
        }

        @Override
        public InputStream nextElement() {
            String file = files.next();
            return s3.getObject(bucketName, file).getObjectContent();
        }
    };

    // Concatenate data
    try (InputStream concatenatedFileContent = new SequenceInputStream(fileStreams)) {
        ObjectMetadata concatenatedFileMetadata = new ObjectMetadata();
        concatenatedFileMetadata.setContentLength(newFileSize);
        s3.putObject(bucketName, "result", concatenatedFileContent, concatenatedFileMetadata);
    }

    // Print result
    IOUtils.copy(s3.getObject(bucketName, "result").getObjectContent(), System.out);
} finally {
    deleteBucketAndAllContents(bucketName);
}

cesar-revert commented 7 years ago

Hello millems,

Thank you very much for your suggestion. We adapted our code to use the lazy loading, and it's working fine. For us, this issue is solved and can be closed.

millems commented 7 years ago

Thanks for the update! Glad to hear you got it working.

cesar-revert commented 7 years ago

Yes, it's working. But the fact that it also worked without the lazy loading when the contentLength was not provided, and then it failed when the contentLength was specified, suggests that there could be a bug on the S3 client when using a SequenceInputStream.

Sure, the lazy loading is an excellent solution (much more when dealing with big files). But if there is no issue with the streams in the SequenceInputStream, I believe the lazy loading shouldn't be the mandatory workaround to make it work. If it works without the lazy loading when the contentLength is omitted, I believe it should work too when the contentLength is provided. In this case, it's just the contentLength check on the S3 client that's failing.

millems commented 7 years ago

I'm not able to reproduce this issue, even without lazy loading (for a small number of small files - it's likely to fail for a large or many files). Do you have a self-contained reproduction case for it?

aws / aws-sdk-java

AmazonS3 client : putObject execution error when using SequenceInputStream (and a warning message is retreived) #1265

1 public boolean concatenateFiles(String bucketName, String preffix, String destKey, boolean deleteOrig) {

2

3 ObjectListing objectListing = s3Client.listObjects(bucketName, preffix);

4

5 Vector inputs = new Vector();

6 ArrayList objects = new ArrayList();

7

8 long contentLength = 0;

9

10 while (true) {

11 for (Iterator<?> iterator =

12 objectListing.getObjectSummaries().iterator();

13 iterator.hasNext();) {

14 S3ObjectSummary summary = (S3ObjectSummary)iterator.next();

15 inputs.add(getObjectInputStream(bucketName, summary.getKey()));

16 objects.add(summary.getKey());

17 contentLength += summary.getSize();

18 System.out.println("adding "+ summary.getKey());

19 }

20 // more object_listing to retrieve?

21 if (objectListing.isTruncated()) {

22 objectListing = s3Client.listNextBatchOfObjects(objectListing);

23 } else {

24 break;

25 }

26 };

27

28 Enumeration enu = inputs.elements();

29 SequenceInputStream sis = new SequenceInputStream(enu);

30

31 ObjectMetadata metadata = new ObjectMetadata();

32 metadata.setContentLength(contentLength);

33 s3Client.putObject(new PutObjectRequest(bucketName, destKey, sis, metadata));

34

35 if (deleteOrig) {

36 for (String object : objects) {

37 s3Client.deleteObject(bucketName, object);

38 }

39 }

40

41 return true;

42 }