Open leothomas opened 3 years ago
Some thoughts on the ingestion pipeline:
Scientists contact the developers to determine correct display options for the dataset (eg: colour swatches, legend stops, rescaling factor, etc). A dataset metadata json file gets added to the API's codebase and the API gets deployed. Every 24hrs, the dataset metadata generator lambda reads through all the dataset metadata json files for the S3 folders of each dataset, and searches these folders for available dates for each dataset, and writes them to a json file in S3. The API then reads from this JSON file to display the dates available for each dataset.
Most datasets are not regularly updated at this point (eg: xo2-*
, pzd-anomaly
, OMNO2d_*
). For some datasets scientists contact either @olafveerman or myself directly by email and we take care of processing (if needed) and uploading the data to s3 (eg: agriculture-cropmonitor
just needs to be uploaded to S3, nightlights-viirs
is delivered as web-mercator tiles that need to be "stitched" together into COGs before being uploaded. This processing is done from my laptop, using a custom script).
The goal of this ingest pipeline is to minimize as much as possible the manual steps needed when ingesting data during initial and recurring deliveries.
cURL
)? .tif
), zip archives (.zip
) and compressed tarballs (.tar.gz
).
Scientists use the AWS CLI to copy datafiles (.tif
) to an S3 bucket
S3 Lambda trigger runs on create:*
actions in the /delivery
subfolder
The S3 trigger lambda function executes the following steps (the lambda function is packages with GDAL and rasterio using the lambgeo/docker-lambda
docker image, and either deployed as a zip folder or as a docker image directly).
3.1. Validate wether or not tiles are valid COGs, using rasterio's rio_cogeo.cogeo.cog_validate()
3.2. If the file is not a valid COG attempt to convert it to a COG using rasterio's rio_cogeo.cogeo.cog_translate()
3.3. If the COG generation fails abort the ingestion
3.4. Generate .vrt
file
3.5. Move data files to final location
Accept .zip
and/or .tar.gz
archives
This will likely be impossible to do in a lambda function for many of the datasets given the size of the data files (especially if for global layers updated daily), and would require a non-lambda based processing step (AWS Batch or ECS instance, etc). These can be orchestrated using a StepFunction. The overall flow would look like:
StartExecution
API.vrt
create operations (in a lambda function)Accept data uploads other than directly uploading to S3:
cURL
to execute a PUT request Notify the scientists if their upload has been rejected
This will require a way for scientists to indicate a "contact" email, either at the time of upload or by adding a contact
field to the dataset's metadata file. This will also require a sending address or domain to be verified in SES in whichever AWS account the ingestion pipeline will be running in.
Store/Log items in a STAC API
This will require standing up a STAC API. eg: CDK construct and the sat-api-pg
instance from the CSDAP order's API.
Generate MosaicJSON's from the dataset tiles This will require having ingested the data items as STAC items. The code required to create a mosaic can be copied from the latest versions of titiler. Covid API was originally a fork of titiler, so we may also decide to rebase the fork off of the latest version (this would likely be a lot of work for not necessarily a huge advantage)
A staging "sandbox" where scientists can first deliver their datasets, view them in the dashboard and then release them publicly
Goal: put together an ADR describing the ingestion pipeline
Related to #37
This includes (in order to complexity to implement):
nodata
valuenodata
value