Closed emmambd closed 2 years ago
After discussion, here are the initial 3 workflows we will do:
Store the latest dataset on approval:
catalogs/sources/gtfs/schedule
:
Store the latest dataset using a daily cronjob:
catalogs/sources/gtfs/schedule
:
Detect deleted and renamed files on PR(nice-to-have):
catalogs/sources/gtfs/schedule
, raise an error to prevent merging the PR.For the first workflow (1), the datasets uploaded to the bucket will overwrite the previous ones because we are using the source filename to identify the datasets. This is okay since we will make sure adding a new source will not overwrite another source in the catalogs.
What problem are we trying to solve? Users need to get the latest dataset of a source, which means modifying the current data pipeline and extracting the latest dataset and its bounding box.
How will we know when this is done? There is a latest dataset URL available for each source. A cronjob runs daily to check for the most recent update.
Constraints