ASFHyP3 / OpenData

Supporting our datasets available via AWS OpenData
1 stars 0 forks source link

Preparing & uploading HAND data #9

Closed jhkennedy closed 10 months ago

jhkennedy commented 10 months ago

See this issue for a description of the work here: https://github.com/ASFHyP3/OpenData/issues/10#issue-2007381175

Importantly, this bucket is tiny compared to the ITS_LIVE bucket, so any transfer option will be just fine. We're going to use this dataset to try out what we expect to be the optimal way to transfer ITS_LIVE data.

jtherrmann commented 10 months ago

We did the following:

  1. Followed Section 5 (p. 9) of https://assets.opendata.aws/aws-onboarding-handbook-for-data-providers-en-US.pdf to create the temporary bucket in the destination account
  2. Started following Option 1 from https://aws.amazon.com/blogs/storage/cross-account-bulk-transfer-of-files-using-amazon-s3-batch-operations/ and stopped at "After the inventory report is available, create an Amazon S3 Batch Operations [...]" while we wait for the inventory report to be created.

While following the blog tutorial, we edited the policy attached to the BatchOperationsDestinationRoleCOPY role to be as follows (more specific permissions):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:GetObjectAcl",
                "s3:GetObjectTagging",
                "s3:GetObjectVersionAcl",
                "s3:GetObjectVersionTagging"
            ],
            "Resource": [
                "arn:aws:s3:::opendata-hand-temporary/*",
                "arn:aws:s3:::glo-30-hand/*",
                "arn:aws:s3:::opendata-hand-temporary-inventory-report/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectVersionAcl",
                "s3:PutObjectAcl",
                "s3:PutObjectVersionTagging",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::opendata-hand-temporary-inventory-report/*"
                "arn:aws:s3:::opendata-hand-temporary/*"
            ]
        }
    ]
}

Edit: The AWS Batch Operations job will write it's transfer report to the inventory report bucket, so it needs write permission there too.

jhkennedy commented 10 months ago

Because The above tutorial is written for creating the AWS Batch Operations job in the destination account, I reached out to AWS OpenData and received this reply:

We do not typically cover costs for S3 Batch Operations (S3BO), and since batch is a pull operation, the charges would incur on the Open Data paid account which we should not do. Check the FAQ in the Handbook, section B.4 regarding large bulk transfers and see if that will work for you instead.

B.4 suggests using the AWS CLI to copy the objects from "a computer in the same region as the S3 bucket", and adjusting the multipart_chunksize and max_concurrent_requests parameters.

However, it looks like it's possible to run the S3BO job from the source account, which I believe would prevent any non-s3 "put" charges from incurring on the OpenData account side.

We'll need to either re-work the roles/policies/permissions we setup for S3BO so we can run in the source account, or just go ahead and follow their recommendations and use the AWS CLI.

jhkennedy commented 10 months ago

We transferred everything via the AWS CLI on a laptop with these config settings:

[default]
region = us-west-2
role_arn = arn:aws:iam::879002409890:role/OrganizationAccountAccessRole
source_profile = default
s3 =
  max_concurrent_requests = 1011
  max_queue_size = 10000
  multipart_threshold = 100MB
  multipart_chunksize = 100MB

using this command and its inverse after we emptied the original bucket and reclaimed the name in the OpenData account:

aws s3 sync s3://glo-30-hand/ s3://opendata-hand-temporary/

Each transfer took ~3 minutes, but after deleting the s3://glo-30-hand/ it took ~1 hour for the name to be released so we could reclaim it in the OpenData account. So overall time was a couple of hours.

For ITS_LIVE, we're planning on doing the transfer on instances deployed to us-west-2 and we'll explore CLI config settings, instance sizes, and the number of concurrent instances to best optimize the transfer time.