NASA-IMPACT / veda-backend

Backend services for VEDA
Other
12 stars 5 forks source link

Add AWS provisioning to raster api cdk workflow for ingestion pipelines #22

Open jvntf opened 2 years ago

jvntf commented 2 years ago

For the cloud optimized data pipelines, we need to set certain resources to be created within the VPC in order for the ingestion Lambda's to work properly (until now this has been done manually).

At least one Lambda task needs to write to the database, while also maintaining access to the internet. Following this guide, we should create the following when deploying the raster api cdk construct:

Tha lambda that needs to acc cc @anayeaye @abarciauskas-bgse

leothomas commented 2 years ago

@jvntf Which resources need to be created in the VPC?

Checkout this CDK code in the APT stack. It does much of what you're describing above:

jvntf commented 2 years ago

thanks @leothomas! i will check out this code

sharkinsspatial commented 2 years ago

@jvntf @leothomas Are there defined estimates for

  1. Which targets these ingestion Lambdas will downloading data from?
  2. What target they will moving data to (I am assuming an S3 bucket).
  3. The potential volume of data they will be transferring?

The architectural requirements you have here (a Lambda which needs public internet access and simultaneously need to communicate with a db on a private subnet) have been widely discussed across projects (I believe @leothomas And I had some discussions surrounding this issue for APT). I'm not very experienced with AWS networking intricacies but the one concern we might have with the approach used in APT here is that given the potentially large volume of transfer data potential there is the possibility of substantial NAT Gateway data processing and data transfer costs. @abarciauskas-bgse @wildintellect and Phil Varner prepared a very detailed document discussing some of this case https://docs.google.com/document/d/1uYr6XnEQY9Bx7_uamia9aGimQW99L52IXg6LIrsdH2A/edit#

I'd be interested in us landing on a canonical decision for how we handle this case. In previous projects we have taken the low security approach of placing the RDS instance in public subnets to avoid NAT Gateway charges for the massive data transfers we have downloading Sentinel 2 data from ESA (https://github.com/NASA-IMPACT/hls-sentinel2-downloader-serverless). This database only stores non-sensitive ephemeral log data so this was a tradeoff, but I'm unsure what the optimal approach is here.

If the answer for question 1 above will always be S3 then I believe this is a moot point and an S3 VPC Endpoint should eliminate the NAT Gateway overhead costs, but someone with more experience in this area might have better details.

sharkinsspatial commented 2 years ago

cc @edkeeble For reference as we investigate architecture options.

edkeeble commented 2 years ago

@sharkinsspatial I did some digging this afternoon on the feasibility of using a single lambda function associated with both the private and public subnets in the VPC (to attempt to bypass the NAT gateway charges when downloading large amounts of data). As @wildintellect already stated, this is not possible (never should have doubted you!).

While you can assign a lambda to multiple subnets, this is essentially defining a pool of subnets to which the lambda might connect. Each time the lambda is invoked, a random subnet is selected from that pool and a network interface is created within that subnet. Hence, all subnets in the pool must be functionally identical.

If we expect the ingestion process to download large amounts of data from the Internet, we could use a two-step approach:

  1. A lambda outside the VPC downloads the data, converts it to a suitable format (e.g. ndjson for loading into pgstac) and stores it in an intermediate location easily and cheaply accessible from the VPC (e.g. S3)
  2. A second lambda inside the private subnet of the VPC fetches that file and loads it into the DB

Any number of ways to handle the above approach: we could use step functions, an s3 trigger for the second function, or the first function could invoke the second function directly.

sharkinsspatial commented 2 years ago

@edkeeble @wildintellect Thanks so much for investigating this. It is great to have some clarity around this Lambda limitation. I agree with fully with a 2 step approach. I don't know what the scale / rate of ingestion would be like for this service but my guess is that @bitner would recommend periodic batch loads of large ndjson files to pgstac over frequent individual inserts via the transactions endpoint. I'd imagine a two step process potentially like

  1. A Lambda in a public subnet copies a file to the target S3 location and pushes a message with that location to SQS.
  2. A Lambda in a private subnet polls SQS via a VPC Endpoint for items to load in a chunk size we can configure, streams them to an ndjson file, loads it to pgstac and marks the messages as processed.

This might be slight overkill but it gives us a nice throttle to control how we are interacting with the database. I've been considering refactoring HLS to use a similar approach as our error reprocessing logic sometimes results in making thousands of near simultaneous db inserts (which would make @bitner 😢 )

abarciauskas-bgse commented 2 years ago

As a reminder to myself + the group, I think if we successfully use this multi-lambda multi-subnet type approach we should make sure to demo it both to Development Seed Earthdata team and NASA IMPACT team as a proposed best practice for cost effective handling of external data transfer in Lambdas while maintaining an RDS in a private VPC subnet.