NASA-IMPACT / covid-api

MIT License
14 stars 3 forks source link

EPIC: Formalize ingestion process #136

Open leothomas opened 3 years ago

leothomas commented 3 years ago

Related to #37

This includes (in order to complexity to implement):

leothomas commented 3 years ago

Some thoughts on the ingestion pipeline:

Current process:

Contributing new datasets:

Scientists contact the developers to determine correct display options for the dataset (eg: colour swatches, legend stops, rescaling factor, etc). A dataset metadata json file gets added to the API's codebase and the API gets deployed. Every 24hrs, the dataset metadata generator lambda reads through all the dataset metadata json files for the S3 folders of each dataset, and searches these folders for available dates for each dataset, and writes them to a json file in S3. The API then reads from this JSON file to display the dates available for each dataset.

Contributing new data for existing datasets:

Most datasets are not regularly updated at this point (eg: xo2-* , pzd-anomaly, OMNO2d_*). For some datasets scientists contact either @olafveerman or myself directly by email and we take care of processing (if needed) and uploading the data to s3 (eg: agriculture-cropmonitor just needs to be uploaded to S3, nightlights-viirs is delivered as web-mercator tiles that need to be "stitched" together into COGs before being uploaded. This processing is done from my laptop, using a custom script).

Goal:

The goal of this ingest pipeline is to minimize as much as possible the manual steps needed when ingesting data during initial and recurring deliveries.

Unknowns:

Lowest complexity/ 1st iteration implementation of the ingestion pipeline:

1. Delivering the data:

Scientists use the AWS CLI to copy datafiles (.tif) to an S3 bucket

2. Tiggering the ingest:

S3 Lambda trigger runs on create:* actions in the /delivery subfolder

3. Ingest script:

The S3 trigger lambda function executes the following steps (the lambda function is packages with GDAL and rasterio using the lambgeo/docker-lambda docker image, and either deployed as a zip folder or as a docker image directly). 3.1. Validate wether or not tiles are valid COGs, using rasterio's rio_cogeo.cogeo.cog_validate() 3.2. If the file is not a valid COG attempt to convert it to a COG using rasterio's rio_cogeo.cogeo.cog_translate() 3.3. If the COG generation fails abort the ingestion 3.4. Generate .vrt file 3.5. Move data files to final location

Potential improvements:

olafveerman commented 3 years ago

Goal: put together an ADR describing the ingestion pipeline