eclipse-edc / Technology-Aws

Apache License 2.0
5 stars 15 forks source link

AWS S3 folder copy not working #384

Closed lholthof closed 2 weeks ago

lholthof commented 1 month ago

Bug Report

The folder copy feature for AmazonS3-PUSH scenario, controlled via objectPrefix property is throwing an exception within our dataplane (running tractus-x version 0.7.2).

Describe the Bug

The upload into the consumer bucket is failing for folder copies. While at the same time single files (specified by objectName) are working fine for the same bucket.

The scenario is setup as the following and can be reproduced.

Asset:

{
  "@context" : {
    "@vocab" : "https://w3id.org/edc/v0.0.1/ns/"
  },
  "@type" : "https://w3id.org/edc/v0.0.1/ns/Asset",
  "@id" : "5db242a2-ee41-4c97-ae2d-f5944b123931",
  "https://w3id.org/edc/v0.0.1/ns/properties" : {
    "name" : "DSI-Demo-Asset",
    "description" : "Asset for demonstration purposes",
    "version" : "0.6.0",
    "contenttype" : "application/json",
    "additionalDescription" : "DSI Demo Asset"
  },
  "https://w3id.org/edc/v0.0.1/ns/privateProperties" : {
    "privateKey" : "privateValue - This field is used to simulate a private property"
  },
  "https://w3id.org/edc/v0.0.1/ns/dataAddress" : {
    "@type" : "DataAddress",
    "type" : "AmazonS3",
    "name" : "DataAddressS3",
    "region" : "eu-central-1",
    "bucketName" : "dsibucket-dev-provider-001",
    "accessKeyId" : "AKIA47CRXF32HAGFF35F",
    "secretAccessKey" : "<<masked>>",
    "objectPrefix" : "testFolder"
  }
}

Transferprocess:

{
    "@context": {
        "@vocab": "https://w3id.org/edc/v0.0.1/ns/"
    },
    "@type": "TransferRequest",
    "assetId": "5db242a2-ee41-4c97-ae2d-f5944b123931",
    "contractId": "388e8680-32a8-40c6-9c91-6d67401a62d8",
    "counterPartyAddress": "https://foss-edc-test.c-139b975.stage.kyma.ondemand.com/api/v1/dsp",
    "dataDestination": {
        "accessKeyId": "AKIA47CRXF32JPT5VTNI",
        "bucketName": "dsibucket-dev-consumer-001",
        "folderName": "testFolder",
        "privateKey": "privateValue - This field is used to simulate a private property",
        "region": "eu-central-1",
        "secretAccessKey": "<<masked>>",
        "type": "AmazonS3"
    },
    "privateProperties": {
        "privateKey": "privateValue - This field is used to simulate a private property"
    },
    "protocol": "dataspace-protocol-http",
    "transferType": "AmazonS3-PUSH"
}

Expected Behavior

On consumer bucket I would expect the contents of the provider folder testFolder to be copied into testFolder/testFolder

Observed Behavior

The upload does not happen at all. In the dataplane logs I can see the failure listed below.

Debugging the issue on provider side's dataplane within S3DataSink, I see that there is an individual part for the provider folder testFolder which has bytesChunk.length = 0. image

The S3DataSink code generates an uploadId and without transferring a single chunk it tries to complete the MultipartUpload. This is failing with the following error message:

SEVERE 2024-07-24T14:17:52.37640665 Failed to upload the testFolder/testFolder/ object: The XML you provided was not well-formed or did not validate against our published schema (Service: S3, Status Code: 400, Request ID: YGYWJBB3Z951R41R, Extended Request ID: P2KQ+nsZtCSZFZ+2OYUGSRTGefMbWP6xQ5j7LiXYVU4fS0z0kfZK2IhnDUGjz1QsfZh8WA3M5RWrdLmLP4b0kA==)
software.amazon.awssdk.services.s3.model.S3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: S3, Status Code: 400, Request ID: YGYWJBB3Z951R41R, Extended Request ID: P2KQ+nsZtCSZFZ+2OYUGSRTGefMbWP6xQ5j7LiXYVU4fS0z0kfZK2IhnDUGjz1QsfZh8WA3M5RWrdLmLP4b0kA==)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
    at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
    at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93)
    at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
    at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
    at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
    at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
    at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
    at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
    at software.amazon.awssdk.services.s3.DefaultS3Client.completeMultipartUpload(DefaultS3Client.java:731)
    at org.eclipse.edc.connector.dataplane.aws.s3.S3DataSink.transferParts(S3DataSink.java:75)
    at org.eclipse.edc.connector.dataplane.util.sink.ParallelSink.lambda$transfer$3(ParallelSink.java:82)
    at org.eclipse.edc.spi.telemetry.Telemetry.lambda$contextPropagationMiddleware$2(Telemetry.java:102)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
DEBUG 2024-07-24T14:17:52.377924963 [DataPlaneManagerImpl] DataFlow 1a3e9729-8d19-43ed-b9e3-5dd0c2917b11 is now in state FAILED

I guess this happens because completedParts does not contain a single entry which is considered as invalid request by AWS (The XML you provided was not well-formed or did not validate against our published schema (Service: S3, Status Code: 400...). As the folder part always comes as the initial part, the full copy process is aborted.

github-actions[bot] commented 1 month ago

Thanks for your contribution :fire: We will take a look asap :rocket:

hemantxpatel commented 1 month ago

This usually happens when someone creates an empty folder / directory in S3 and then uploads some files into same folder / directory. In that case list-objects-v2 API returns two object, one folder with size 0 and another file with its actual size. When S3DataSink#transferParts() tries to upload an object with size 0, it fails with above exception.

How to reproduce via AWS CLI

  1. Upload an empty folder inside a bucket.
    aws s3api put-object --bucket dsibucket-dev-consumer-001 --key testfolder1/
  2. Upload a file into same folder.
    aws s3api put-object --bucket dsibucket-dev-consumer-001 --key testfolder1/10mb.txt --body ./10mb.txt
  3. List Objects in the folder.
    aws s3api list-objects-v2 --bucket dsibucket-dev-consumer-001 --prefix testfolder1/

    Response:

    {
    "Contents": [
        {
            "Key": "testfolder1/",
            "LastModified": "2024-07-25T06:14:42+00:00",
            "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
            "Size": 0,
            "StorageClass": "STANDARD"
        },
        {
            "Key": "testfolder1/10mb.txt",
            "LastModified": "2024-07-25T06:16:34+00:00",
            "ETag": "\"2d94c9f3cbfa5fbc410a7a8b72f8cee1\"",
            "Size": 10485772,
            "StorageClass": "STANDARD"
        }
    ],
    "RequestCharged": null
    }

If file is directly uploaded without creating folder first, it doesn't return folder in the list of objects.

aws s3api put-object --bucket dsibucket-dev-consumer-001 --key testfolder2/10mb.txt --body ./10mb.txt
aws s3api list-objects-v2 --bucket dsibucket-dev-consumer-001 --prefix testfolder2/

Response:

{
    "Contents": [
        {
            "Key": "testfolder2/10mb.txt",
            "LastModified": "2024-07-25T06:17:37+00:00",
            "ETag": "\"2d94c9f3cbfa5fbc410a7a8b72f8cee1\"",
            "Size": 10485772,
            "StorageClass": "STANDARD"
        }
    ],
    "RequestCharged": null
}

Proposed Solution

In S3DataSource#openPartStream() method, filter any S3 objects based on any one of the below criteria.

  1. FIlter out if object has size 0.
  2. FIlter out if Key is same as prefix. i.e. skip the empty folder object which was created first before file being uploaded into it. https://github.com/eclipse-edc/Technology-Aws/blob/562c56859089e6522c396acf6011d6197760768d/extensions/data-plane/data-plane-aws-s3/src/main/java/org/eclipse/edc/connector/dataplane/aws/s3/S3DataSource.java#L65-L76

Reference

https://stackoverflow.com/a/75620490

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 14 days with no activity.

rafaelmag110 commented 4 weeks ago

Even if the multipart upload worked for a 0 byte object, the key name for the folder part is wrongly resolved as testFolder/testFolder/ by the getDestinationObjectName() function (you can see the key var value in the debug print).

As for the proposed solution by @hemantxpatel, i do agree filtering the empty folder object part is a good solution to avoid it being wrongly resolved by the getDestinationObjectName().

However i do not agree we should filter out objects of size 0 as that limits transfering empty files which can be a real transfer scenario.

For empty files, the AWS documentation lacks to explain if having an empty completedParts list is valid to complete a multi part upload. It just says here a Part cannot be invalid, but says nothing about empty lists. Nevertheless, using multi part might not be the best approach in the empty file Part case. When bytesChunk comes empty and completedParts is also empty, the multi part should be aborted and a PutObject should be used instead. This way we can guarantee empty files are also a valid transfer scenario.

Also, I quickly glanced the tests and it seems no case exists for an empty file transfer. Something to be improved.

@hemantxpatel Since you have the initial solution proposal, would you like to come forward and bring in a PR for this?

bmg13 commented 3 weeks ago

hey everyone :) can I be assigned to this issue, please?

paullatzelsperger commented 3 weeks ago

hey everyone :) can I be assigned to this issue, please?

for some reason GH doesn't let me assign this to you. I used @rafaelmag110 as stand-in, so we know someone's working on it.

[edit] assigned you

hemantxpatel commented 3 weeks ago

Hi all, @rafaelmag110 asked me to work on it, so I had started working on it. I verified my code via doing an S3 to S3 transfer and it works well.

@bmg13 Let me know if you haven't already started and I can open the PR, otherwise it's a small fix. Just need to convert while loop to a do while loop, so that a part is uploaded even it has size zero. https://github.com/eclipse-edc/Technology-Aws/blob/e6e78a3cb1dbcecd29ad6b7a1ea93e6ac609f9b0/extensions/data-plane/data-plane-aws-s3/src/main/java/org/eclipse/edc/connector/dataplane/aws/s3/S3DataSink.java#L63-L73

rafaelmag110 commented 2 weeks ago

Thanks @hemantxpatel I contacted you directly because in this case we wanted to move the fix a bit faster so we could have it in time for a downstream bugfix release. We indeed started working on this and should have the PR ready today.

Sorry for the confusion.