Add LPDAAC S3 credential rotation dynamic tiler lambda (for HLS)

abarciauskas-bgse commented 2 years ago

The dynamic tiler may be requesting data from Earthdata cloud buckets, such as the HLS data provided by LP.DAAC. The tiler need to have some sort of credentials to requests to those files. This could be done by storing URS credentials in a .netrc file but @sharkinsspatial has created EDL credential rotations for direct S3 access which should be faster than authenticating through URS for each request: https://github.com/NASA-IMPACT/edl-credential-rotation. We should probably re-use this approach in our backend API.

anayeaye commented 2 years ago

I deployed a separate edl-credential-rotation stack for delta-backend-dev. In the raster-api handler in delta-backend-dev a small change is needed to use the edl aws session credentials from the lambda environment.

With these changes the raster-api is able to pick up and use the credentials however the API is currently deployed in an isolated subnet which is preventing us from accessing the external s3 files. Unfortunately the CDK VPC configuration is blocking the deployment of private-with-nat lambdas due to poor CIDR block planning. This change management plan includes steps to resolve the VPC issue.

feature/edl-4-rasterapi contains the lambda changes as well as some minor changes to GDAL environment variables.

sharkinsspatial commented 2 years ago

@anayeaye @vincentsarago is helping investigate some GDAL optimizations for our use cases that will most likely affect https://github.com/NASA-IMPACT/delta-backend/compare/feature/edl-4-rasterapi#diff-08a35aa423ced1c2c9aeb17d6a439c22744578a3b1cbdfee77f2f26be39554c1. Is this a good location to ping you with updates as we learn more?

anayeaye commented 2 years ago

@sharkinsspatial thanks--this is a great place for updates!

abarciauskas-bgse commented 2 years ago

@anayeaye change management plan looks great, we should add it to a VEDA project folder so we can re-use it or reference it in the future. Thanks for writing it up.

A few questions below but I think we want to send this to the front end developers (Daniel, Ricardo, Hanbyul), data publishers (Iksha, Slesa) and the ESA development team which has been using the Staging API ASAP so they are aware staging may go down for 1-2 days next week - do you agree?

Questions about the change management plan:

Two resource changes are needed for the delta backend stack that cannot be implemented with a simple CDK deployment.

Can we make it clear here that the plan is to deploy a new stack and once we have verified its operational to update the domain name servers to point to the new stack endpoints?

The pgstac database needs to be upgraded to a new schema that will allow us to ingest temporally dense data like CMIP6.

Can we make it clear we are upgrading pgstac which is a schema for the postgresql database in RDS (as opposed to the version of posgresql itself) and from version XX to XX and a link https://github.com/stac-utils/pgstac. Also add that we will be also creating a snapshot of the existing database and using it to restore the existing datasets to the new database and schema?

Confirm database snapshot retention period is adequate for this transition work

I think adequate here just means that there is no risk of there having been changes to the database between the date of the most recent snapshot and when we use it to populate the new database, is that your definition as well?

Test

What types of tests will you run?

anayeaye commented 2 years ago

@abarciauskas-bgse Thank you for your change management review comments! I have updated the document and agree that we need to share with the wider VEDA team ASAP. As far as staging going down, I think this plan ensures that staging will not be down for more than an hour or two but we will have a window when new data ingests would be lost--the dev stack work should give us a good estimate of how long that will be.

I am not sure that the RDS restore plan is even viable (I hope it is!). I think that I can test it tomorrow and then tighten up the dates for sharing.

sharkinsspatial commented 2 years ago

@manilmaskey brought up a valid question in today's IMPACT meeting that made me consider the fact that we should have a broader strategy for cross account bucket access with the DAACs. I adopted the temporary S3 credential rotation strategy for the HLS tiler because our delivery timelines for integration with the FIRMS application were extremely short and this didn't leave adequate change to coordinate with LPDAAC on a large administrative change.

@tracetechnical and I chatted a bit about this today and given the frequent maintenance windows and periodic instability of EDL it would be a good idea to have someone from IMPACT engage directly with the relevant DAACs and check if cross account policies with read access can be enabled for all roles in our accounts. There are several approaches for tackling this but it would be good to first determine if this is feasible from a policy perspective. cc @abarciauskas-bgse @anayeaye

anayeaye commented 2 years ago

Still pushing this EDL service forward as a temporary solution until cross account policies are established. PR #50 handles the VPC CIDR range limitations that were preventing us from adding the private-with-nat subnets needed to render HLS data on the map.

Currently there is not an edl-login-service deployed for the delta-backend (I took it down while navigating the VPC changes). The feature branch for the delta-backend raster-api changes needed for EDL is still open but will need a catch up when we come back to this issue.

anayeaye commented 2 years ago

This work is on hold, we should consider an alternate tiler for HLS data. PR #56 documents how credential rotation was added to a test delta backend stack and why it cannot be used as is (tl;dr we can only tile HLS or our own hosted COGs in a single tiler).

anayeaye commented 2 years ago

Noting a possible solution to the issue raised in pr 56 from @abarciauskas-bgse @vincentsarago @sharkinsspatial:

Add an additional tiler to the delta-backend deployment that will receive EDL tokens and use the dataset configuration or collection metadata to choose what tiler is used.

anayeaye commented 2 years ago

@abarciauskas-bgse Here are some notes about what I think the delta-backend can do to support HLS for the trilateral release. I think that the second scenario is what you are proposing and I can get started on it if I have the right idea...

Short term trilateral release commitment

Two possible short term solutions exist in which we provide a us-west delta-backend stack deployed with a snapshot of the staging database (and redeploy as needed to add latest staging-stack ingests). In both the cloudformation stack will have a new name (like delta-backend-west) with the possibility of moving custom domain API users over to this new us-west backend in the future (i.e. cut over traffic from https://staging-stac.delta-backend.xyz to this new backend).

Scenario 1 (single delta backend in us-west only supporting LPDAAC-CLD)

Deploy latest delta-backend to us-west-2 with a LPDAAC credential rotation service using a snapshot of the latest staging pgstac database.
Provide tiler base url and obtain contact list for events that cause this url to change (custom domain is fixed but some VPN users will need the raw API Gateway url). This tiler will work for LPDAAC map layers and fail for other STAC collections.

Scenario 2 (multiple tilers one delta backend deployed in us-west)

Deploy latest delta backend with 3 provider-dedicated tilers

Primary titiler serves public S3 COGs and COGs co-located with the delta-backend stack (what is already running today)
LPDAAC-CLD tiler for HLS data (and any others hosted in LPDAAC-CLD)
ORNLDAAC-CLD tiler for GEDI L4B "
1. Add new construct(s) to delta-backend for 2 additional tilers that will support
2. LPDAAC and ORNL (low effort, essentially copy paste and add new resource names).
3. Deploy latest delta-backend to us-west-2 with a LPDAAC (and possibly ORNL) credential rotation service using a snapshot of the latest staging pgstac database.
4. Provide tiler base url and obtain contact list for events that cause this url to change (custom domain is fixed but some VPN users will need the raw API Gateway url)

Work required

Integrate raster-api env variable changes to use session credentials obtained by credential rotation service (we have already done this in us-west backend pr 56)
Deploy credential rotation service(s) (we have also already verified the edl-credential-rotation service for the delta-backend)
Support required: keeping the us-west stack postgres database up to date with to what is ingested in staging. Providing updates to system users/developers when tiler API gateway url changes (not expected but possible especially if we switch from scenario 1 to 2 midstream).
[scenario 2 only] Add new dedicated raster-api/tiler constructs to delta-backend deployment
[ORNL only] Minor adjustment to edl-credential-rotation to parametrize the auth url for ORNL-CLOUD. This can be completed in sequence after the HLS tiler is running

abarciauskas-bgse commented 2 years ago

To summarize my conversation with @anayeaye yesterday, I believe we want to deliver a parameterized endpoint so that clients can still use the same API endpoint for doing visualization but pass a parameter identifying the data provider. The reasoning behind this is that, while many datasets will live in the "VEDA data store bucket", other datasets in our API will be maintained by other "data providers" - most likely to be DAACs. While we will probably need some things to be true for all VEDA data providers (in that we have some way of accessing the data from our systems), I think it's the case that we will have different backend implementations to make requests of these providers, such as different S3 credentials.

IN order to make this work we need to:

Add a data provider field to collections, at least when it is different from "VEDA" - data that this program maintains in our buckets
Inform clients that when making requests for items and collections with a specified provider, certain endpoints (such as /cog/tiles) should include the provider parameter
Our endpoint for /cog/tiles should take a parameter (?provider=lpdaac) and then route that request to a specific tiler endpoint which has credentials for that provider

What do you think about this approach @anayeaye @vincentsarago @sharkinsspatial

vincentsarago commented 2 years ago

Our endpoint for /cog/tiles should take a parameter (?provider=lpdaac) and then route that request to a specific tiler endpoint which has credentials for that provider

@abarciauskas-bgse the problem with this approach is that we assume that we will get credential on each tile request which might not possible (throttle) cc @sharkinsspatial

abarciauskas-bgse commented 2 years ago

When you say we will get credential do you mean get the AWS credentials? I was still anticipating we use aws edl credential rotation lambda which I think can include an aws sessions key, not sure if that helps with throttling.

vincentsarago commented 2 years ago

@abarciauskas-bgse oh, so every 30min or so we get credential for multiple provider then on user request we use one of the available credential?

abarciauskas-bgse commented 2 years ago

There are multiple lambdas, one for each provider, each gets new credentials every 30 minutes

sharkinsspatial commented 2 years ago

@abarciauskas-bgse We have a few options here. Due to restrictions on Lambda reserved environment variables keys https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html the credential environment variables AWS_ACCESS_KEY, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY are not set at the Lambda environment variable level and instead we set ACCESS_KEY, ACCESS_KEY_ID, SECRET_ACCESS_KEY which then get mapped to the correct environment variable keys at handler instantiation https://github.com/NASA-IMPACT/delta-backend/pull/56/files#diff-c6579356c48fc61c45cac3e22a45ce276b7dcf42ebe1cf4c0a5417fc22fca4ccR6-R11. We'll have to confirm the Lambda context caching mechanics with @vincentsarago but you could also theoretically have a single Lambda whose handler sets these based on a request query parameter such as ?provider=lpdaac -> os.environ["AWS_SECRET_ACCESS_KEY"] = os.environ["LPDAAC_SECRET_ACCESS_KEY"].

Additionally, all of these environment settings can be injected more explicitly in the mangum application via rasterio.Env(session=session) which could also be modified to use an explicit provider query parameter on a per request basis.

sharkinsspatial commented 2 years ago

Also linking to Patrick's document here for reference which outlines potential longer term strategies around this issue https://docs.google.com/document/d/18GyoMZj0I2HKAXwqyeziO0ISbOwHxo1TN4eAlR4mH3U/view.

vincentsarago commented 2 years ago

@sharkinsspatial FYI we don't use os.environ in impact-tiler but create an AWSSession which we will forward to the rasterio Env https://github.com/NASA-IMPACT/impact-tiler/blob/master/infrastructure/lambda/cog_application.py#L43-L50

This is done at app creation level but could in theory also be done a request level

NASA-IMPACT / veda-backend