Closed jhkennedy closed 10 months ago
We did the following:
While following the blog tutorial, we edited the policy attached to the BatchOperationsDestinationRoleCOPY
role to be as follows (more specific permissions):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:GetObjectAcl",
"s3:GetObjectTagging",
"s3:GetObjectVersionAcl",
"s3:GetObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::opendata-hand-temporary/*",
"arn:aws:s3:::glo-30-hand/*",
"arn:aws:s3:::opendata-hand-temporary-inventory-report/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectVersionAcl",
"s3:PutObjectAcl",
"s3:PutObjectVersionTagging",
"s3:PutObjectTagging"
],
"Resource": [
"arn:aws:s3:::opendata-hand-temporary-inventory-report/*"
"arn:aws:s3:::opendata-hand-temporary/*"
]
}
]
}
Edit: The AWS Batch Operations job will write it's transfer report to the inventory report bucket, so it needs write permission there too.
Because The above tutorial is written for creating the AWS Batch Operations job in the destination account, I reached out to AWS OpenData and received this reply:
We do not typically cover costs for S3 Batch Operations (S3BO), and since batch is a pull operation, the charges would incur on the Open Data paid account which we should not do. Check the FAQ in the Handbook, section B.4 regarding large bulk transfers and see if that will work for you instead.
B.4 suggests using the AWS CLI to copy the objects from "a computer in the same region as the S3 bucket", and adjusting the multipart_chunksize
and max_concurrent_requests
parameters.
However, it looks like it's possible to run the S3BO job from the source account, which I believe would prevent any non-s3 "put" charges from incurring on the OpenData account side.
We'll need to either re-work the roles/policies/permissions we setup for S3BO so we can run in the source account, or just go ahead and follow their recommendations and use the AWS CLI.
We transferred everything via the AWS CLI on a laptop with these config settings:
[default]
region = us-west-2
role_arn = arn:aws:iam::879002409890:role/OrganizationAccountAccessRole
source_profile = default
s3 =
max_concurrent_requests = 1011
max_queue_size = 10000
multipart_threshold = 100MB
multipart_chunksize = 100MB
using this command and its inverse after we emptied the original bucket and reclaimed the name in the OpenData account:
aws s3 sync s3://glo-30-hand/ s3://opendata-hand-temporary/
Each transfer took ~3 minutes, but after deleting the s3://glo-30-hand/
it took ~1 hour for the name to be released so we could reclaim it in the OpenData account. So overall time was a couple of hours.
For ITS_LIVE, we're planning on doing the transfer on instances deployed to us-west-2
and we'll explore CLI config settings, instance sizes, and the number of concurrent instances to best optimize the transfer time.
See this issue for a description of the work here: https://github.com/ASFHyP3/OpenData/issues/10#issue-2007381175
Importantly, this bucket is tiny compared to the ITS_LIVE bucket, so any transfer option will be just fine. We're going to use this dataset to try out what we expect to be the optimal way to transfer ITS_LIVE data.