NASA-IMPACT / covid-api

MIT License
14 stars 3 forks source link

EPIC: Incorporate github based dataset metadata workflow #135

Open leothomas opened 3 years ago

leothomas commented 3 years ago

See: https://github.com/NASA-IMPACT/dashboard-api-starter

leothomas commented 3 years ago

[IN PROGRESS]

Context:

Metadata files contain information about how a dataset should be displayed (legend stops, color map, rescale, etc) as well as where to find the COGs in S3, and the dates available for each dataset. Originally all of this information was contained in JSON files stored in the dashboard (frontend) code repository. In order to avoid having to manually update available dates and re-deploy the dashboard with each new data delivery, the datasets' domains (available dates) generation was moved to a backend process. A /datasets endpoint was created which would, for each dataset, scan the S3 bucket to collect all available files and extract and return the date from each. The metadata files themselves were also moved to the backend.

Due to a growing number of data files and the way the dashboard would query available dates for each dataset individually, the /datasets endpoint's response was becoming too slow. The dataset domain (available dates) generation process was moved to a lambda function that would run once every 24hrs and store the available dates in a JSON file in the same S3 bucket as the rest of the data. The /datasets endpoint now simply reads from this JSON file, and the response time is no longer affected by the number of data files in S3.

When re-thinking the structure of the API for the EO Lab-in-a-box project, @abarciauskas-bgse had the idea to move the dashboard dataset metadata files to a separate github repository. This is a great idea as it keeps the ability to version datatsets and open feature branches when integrating new datasets, without requiring knowledge of the code base. It also gives us the ability use github actions to generate dataset domains when opening or merging PRs.

I'd like to integrate this in the covid dashboard API's workflow - in order to make it easier to visualize and validate datasets in the dashboard without having the deploy the data to production.

Proposed workflow: