NASA-IMPACT / veda-data

2 stars 0 forks source link

Copy Assets from MCP staging to production buckets #100

Closed smohiudd closed 3 months ago

smohiudd commented 5 months ago

What

To support a production instance, STAC assets that are currently in veda-data-store-staging must be copied to veda-data-store-production.

DAG in airflow to copies assets - confirm if its operational in dev or staging

PI Objective

Objective 4: Publish production data

Acceptance Criteria

smohiudd commented 3 months ago

Merged a PR to fix the transfer DAG: https://github.com/NASA-IMPACT/veda-data-airflow/pull/121

smohiudd commented 3 months ago

Tested the following transfer in dev MWAA:

{
    "origin_bucket": "veda-data-store-staging",
    "origin_prefix": "geoglam/",
    "filename_regex": "^(.*).tif$",
    "target_bucket": "veda-data-store",
    "collection": "geoglam",
    "cogify": "false",
    "dry_run": "false"
}

I didn't get any errors in airflow. @anayeaye or @botanical when you get a chance can you check if this worked in MCP?

smohiudd commented 3 months ago

Doing some testing and the airflow DAG can't work without appropriate PUT permission to veda-data-store. I know that vedaDataAccessRole has PUT permissions to:

            "Resource": [
                "arn:aws:s3:::veda-data-store-staging",
                "arn:aws:s3:::veda-data-store-staging/*"
            ]
        },

But do we know if there's a similar policy in MCP for veda-data-store?

botanical commented 3 months ago

@smohiudd how would I check the dev MWAA transfer in MCP?

I see a role in MCP called veda-data-store-access that has:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BucketPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::veda-data-store"
            ]
        },
        {
            "Sid": "ObjectPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectVersionTagging",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::veda-data-store/*"
            ]
        }
    ]
}

and another role veda-data-store-access-staging that has

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BucketPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::veda-data-store-staging"
            ]
        },
        {
            "Sid": "ObjectPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectVersionTagging",
                "s3:PutObjectTagging",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::veda-data-store-staging/*"
            ]
        }
    ]
}
smohiudd commented 3 months ago

@botanical the transfer I ran last night didn't work (you can check by seeing if there are files in the bucket). The DAG failed without error - handler needs some re work.

I ran another test today locally using a fixed handler and it did work for s3://veda-data-store/geoglam/

@anayeaye created a new role for us to use in the airflow transfer handler that should allow PUT operations to the veda-data-store bucket. New role is arn:aws:iam::114506680961:role/veda-data-manager

botanical commented 3 months ago

I see 45 objects in veda-data-store/geoglam/ in MCP which were created around March 13, 2024, 10:34:43 (UTC-07:00) @smohiudd

smohiudd commented 3 months ago

Another PR to fix the transfer util: https://github.com/NASA-IMPACT/veda-data-airflow/pull/122

The transfer DAG is working in dev airflow and is ready to start moving assets. To do this programatically, the next step could be to create a script or notebook and runs the transfer DAG on each collection. The configs would be similar to the discovery items configs with a couple slight modifications.

smohiudd commented 3 months ago

I ran a transfer on Friday and it went OK. There are a few collections I need to rerun but I would say we're most of the way there.

These collections failed because of incorrect errors or config files and need to be run again:

Also, below are special case collections which weren't part of the batch and will require manual transfers.

These datasets will be transferred at a later time: