EPIC: Formalize ingestion process

leothomas commented 3 years ago

Related to #37

This includes (in order to complexity to implement):

Update guidelines/documentation for scientists to verify that their COG's are well formated/contain a nodata value
Implement an S3 trigger that validates the COG formatting/nodata value
- Implement a notification when the validation process fails

leothomas commented 3 years ago

Some thoughts on the ingestion pipeline:

Current process:

Contributing new datasets:

Scientists contact the developers to determine correct display options for the dataset (eg: colour swatches, legend stops, rescaling factor, etc). A dataset metadata json file gets added to the API's codebase and the API gets deployed. Every 24hrs, the dataset metadata generator lambda reads through all the dataset metadata json files for the S3 folders of each dataset, and searches these folders for available dates for each dataset, and writes them to a json file in S3. The API then reads from this JSON file to display the dates available for each dataset.

Contributing new data for existing datasets:

Most datasets are not regularly updated at this point (eg: xo2-* , pzd-anomaly, OMNO2d_*). For some datasets scientists contact either @olafveerman or myself directly by email and we take care of processing (if needed) and uploading the data to s3 (eg: agriculture-cropmonitor just needs to be uploaded to S3, nightlights-viirs is delivered as web-mercator tiles that need to be "stitched" together into COGs before being uploaded. This processing is done from my laptop, using a custom script).

Goal:

The goal of this ingest pipeline is to minimize as much as possible the manual steps needed when ingesting data during initial and recurring deliveries.

Unknowns:

Do scientists know how to use the AWS CLI/S3?
Do scientists know how to perform uploads using the command line (cURL)?
Do scientists know how to use github (eg: create a feature branch, open a PR)?
How much custom code are we willing to write for each data provider?
Do we have to notify data providers when their data fails upload?
How do scientists deliver data? (So far we've had: non-COG tifs (.tif), zip archives (.zip) and compressed tarballs (.tar.gz).
- Do we want to restrict some or all of these formats?

Lowest complexity/ 1st iteration implementation of the ingestion pipeline:

1. Delivering the data:

Scientists use the AWS CLI to copy datafiles (.tif) to an S3 bucket

2. Tiggering the ingest:

S3 Lambda trigger runs on create:* actions in the /delivery subfolder

3. Ingest script:

The S3 trigger lambda function executes the following steps (the lambda function is packages with GDAL and rasterio using the lambgeo/docker-lambda docker image, and either deployed as a zip folder or as a docker image directly). 3.1. Validate wether or not tiles are valid COGs, using rasterio's rio_cogeo.cogeo.cog_validate() 3.2. If the file is not a valid COG attempt to convert it to a COG using rasterio's rio_cogeo.cogeo.cog_translate() 3.3. If the COG generation fails abort the ingestion 3.4. Generate .vrt file 3.5. Move data files to final location

Potential improvements:

Accept .zip and/or .tar.gz archives This will likely be impossible to do in a lambda function for many of the datasets given the size of the data files (especially if for global layers updated daily), and would require a non-lambda based processing step (AWS Batch or ECS instance, etc). These can be orchestrated using a StepFunction. The overall flow would look like:
1. File land in S3, trigger S3 notification
2. S3 Notification invokes Lambda function
3. Lambda function calls the StepFunction StartExecution API
4. The StepFunction runs the following steps: 4.1. Spin up a non-lambda service (Batch or ECS) to unzip the file and write the individual files back to S3 4.2 Map over each of the new files the COG validate + COG create + .vrt create operations (in a lambda function)
Accept data uploads other than directly uploading to S3:
- Add the AWS S3 SDK to the front end, implement an upload widget in the front end that uploads directly to S3
- Add an endpoint to the API that generates a pre-signed S3 upload URL that can be invoked:
- From the command line: using cURL to execute a PUT request
- From a file upload widget in the front end
Notify the scientists if their upload has been rejected This will require a way for scientists to indicate a "contact" email, either at the time of upload or by adding a contact field to the dataset's metadata file. This will also require a sending address or domain to be verified in SES in whichever AWS account the ingestion pipeline will be running in.
Store/Log items in a STAC API This will require standing up a STAC API. eg: CDK construct and the sat-api-pg instance from the CSDAP order's API.
Generate MosaicJSON's from the dataset tiles This will require having ingested the data items as STAC items. The code required to create a mosaic can be copied from the latest versions of titiler. Covid API was originally a fork of titiler, so we may also decide to rebase the fork off of the latest version (this would likely be a lot of work for not necessarily a huge advantage)
A staging "sandbox" where scientists can first deliver their datasets, view them in the dashboard and then release them publicly

olafveerman commented 3 years ago

Goal: put together an ADR describing the ingestion pipeline

NASA-IMPACT / covid-api