MAAP-Project / Community

Issue for MAAP (Zenhub)
2 stars 1 forks source link

DAAC direct Bucket Access #793

Open wildintellect opened 1 year ago

wildintellect commented 1 year ago

Is your feature request related to a problem? Please describe. Not all DAACs have Federated Tokens working with S3 temporary credential endpoints, and managing temporary sessions. As part of the next generation of solutions from EOSDIS we, along with VEDA, are piloting direct bucket access with the same AWS region as EarthDataCloud (us-west-2)

Several DAACs have granted us read access:

Describe the solution you'd like User of the ADE and DPS need a way to make user of these credentials. Possible options:

Describe alternatives you've considered User could manually code Role switching themselves.

Additional context For testing in MCP the MAAP-ADE-K8S role was given trust to assume maap-data-reader. switching back was not tested yet, and probably requires trust the other direction.

The following roles have permissions MAAP Prod account, on MCP arn:aws:iam::8_7:role/maap-data-reader arn:aws:iam::8_7:role/tiler-lambda-role arn:aws:iam::8_7:role/maap-data-manager

MAAP Dev account, on SMCE arn:aws:iam::9_4:role/maap-data-reader-dev arn:aws:iam::9_4:role/maap-data-manager-dev

To test currently you have to do:

export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s" \
$(aws sts assume-role \
--role-arn arn:aws:iam::884094767067:role/maap-data-reader \
--role-session-name TestSessionName \
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
--output text))

Then you can try to access

aws s3 ls s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_02.h5
wildintellect commented 1 year ago

I have a working example with Python code now. It should be simple enough to supply the ARNs via the maap-py library (possibly as SSM parameters)

https://gist.github.com/wildintellect/e561eccdddee851a571004cf1fbe83b8

Funny story, GEDI L4B seems to have more open permissions, or ORNL granted permission to the ADE role too. So I switched to testing with GES DISC data.

chuckwondo commented 7 months ago

@wildintellect, here's another possible approach that works within the ADE:

aws configure --profile maap-data-reader set role_arn arn:aws:iam::884094767067:role/maap-data-reader
aws configure --profile maap-data-reader set credential_source Ec2InstanceMetadata
aws configure --profile maap-data-reader set role_session_name DAAC_Direct  # optional

Now, the AWS CLI and AWS SDK will automatically obtain the necessary credentials when using the maap-data-reader profile (or whichever profile name you choose to use above). This also means that the credentials are not only cached (under ~/.aws/cli/cache/), but also that they are automatically refreshed when they expire.

For example, using the CLI:

$ AWS_PROFILE=maap-data-reader aws s3 ls s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
2023-08-31 12:29:38  308854414 ATL08_20230416235213_04061911_006_03.h5
2023-08-31 12:30:22    5902731 ATL08_20230416235213_04061911_006_03.h5.dmrpp

Using Python:

$ AWS_PROFILE=maap-data-reader python
>>> import boto3
>>> s3 = boto3.client("s3")
>>> response = s3.list_objects_v2(Bucket="nsidc-cumulus-prod-protected", Prefix="ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5")
>>> contents = response["Contents"]
>>> for item in contents: print(item["Key"])
... 
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5.dmrpp
>>> 
wildintellect commented 7 months ago

That means everything will operate under that profile? Will this cause issues for bucket permissions inside MAAP? When running DPS jobs will this cause problems interacting with DPS (writing outputs), etc... I think part of the reason to use the SSM approach in python was that you can apply it to a context as needed, but not have to revert back your role afterwards (unlike assuming a role in cli).

We should also confirm that awscli made it in to all the 3.1.4 images and above.

chuckwondo commented 7 months ago

I'm not following what you mean about the SSM approach. How do you envision SSM parameters being used?

Regarding use of an AWS profile, we can scope the profile to a particular session, like so:

$ python
>>> import boto3
>>> session = boto3.session(profile_name="maap-data-reader")
>>> s3 = session.client("s3")
>>> response = s3.list_objects_v2(Bucket="nsidc-cumulus-prod-protected", Prefix="ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5")
>>> contents = response["Contents"]
>>> for item in contents: print(item["Key"])
... 
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5
ATLAS/ATL08/006/2023/04/16/ATL08_20230416235213_04061911_006_03.h5.dmrpp
>>> 

Regarding maap-py, I'm going to create an issue for enhancement, where the MAAP initializer can accept both a boto3.Session instance and a requests.Session instance. In both cases, if no session object is supplied, a default session object would be created, and the various places where requests are currently made would be updated to use the relevant session to make requests.

Thus, in order to leverage the idea shown above, we might then be able to do something like so (simplified) to download granules from S3 because the maap instance would pass its boto3_session instance through to each Result object returned from search_granules, so that the Result.getData method can make use of the session.

maap = MAAP("api.maap-project.org", boto3_session=boto3.session(profile_name="maap-data-reader"))
granules = maap.search_granules(...)
granules[0].getData()

This can all be done without introducing breaking changes.

wildintellect commented 7 months ago

@chuckwondo we actually already have docs on this. Though we don't show how to pass it to maap-py. https://docs.maap-project.org/en/latest/technical_tutorials/access/direct_access.html

chuckwondo commented 7 months ago

@wildintellect, thanks for the link to the docs. The only downside to that approach is that credentials are not automatically refreshed, so long-running programs might run into errors due to expired credentials.

As part of the research I did several months ago into configuring custom boto3/botocore credentials refreshers, which I will be incorporating into maap-py, I'll work on showing how we can tweak that example in the docs such that we get automatically refreshed credentials (based on what I'll do for maap-py, but not depending on those maap-py changes for the doc example).

See https://github.com/MAAP-Project/maap-py/issues/83

wildintellect commented 7 months ago

Hmm, how long does our current method work; 1 hour, 12 hours? Would be good to document.

chuckwondo commented 7 months ago

It's 1 hour. Here's a response (partially redacted):

{
  "Credentials": {
    "AccessKeyId": "***",
    "SecretAccessKey": "***",
    "SessionToken": "***",
    "Expiration": "2024-03-04T23:40:40+00:00"
  },
  "AssumedRoleUser": {
    "AssumedRoleId": "***:botocore-session-1709592040",
    "Arn": "arn:aws:sts::***:assumed-role/maap-data-reader/botocore-session-1709592040"
  },
  "ResponseMetadata": {
    "RequestId": "f86a9f44-5fd3-4771-9b2f-28cfae125d62",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "f86a9f44-5fd3-4771-9b2f-28cfae125d62",
      "content-type": "text/xml",
      "content-length": "1513",
      "date": "Mon, 04 Mar 2024 22:40:40 GMT"
    },
    "RetryAttempts": 0
  }
}

Specifically, we can see that the expiration is 1 hour after the request time:

wildintellect commented 7 months ago

I looked into this a little, we can increase the duration of these keys up to 12 hours. Would that be helpful to simply reduce the frequency of refreshes needed?

Once the role properties are changed, adding the DurationSeconds would increase the longevity of the session validity.

assumed_role_object = sts.assume_role(
        RoleArn=parameter_value,
        RoleSessionName='TutorialSession',
        DurationSeconds=43200
    )
chuckwondo commented 7 months ago

I looked into this a little, we can increase the duration of these keys up to 12 hours. Would that be helpful to simply reduce the frequency of refreshes needed?

That's an option that might suffice, but I still view it as a bit of a band-aid. Ideally, we want auto-refresh to occur so we don't even care (nor worry about) how long individual creds last.

wildintellect commented 6 months ago

New Task - find a code way to apply the 12 hour limit to the policy. Then we'll then open a new ticket about dealing with refreshing of tokens as needed.

chuckwondo commented 6 months ago

See https://github.com/NASA-IMPACT/active-maap-sprint/issues/884#issuecomment-2004077098