MobilityData / mobility-database-catalogs

The Catalogs of Sources of the Mobility Database.
Apache License 2.0
266 stars 57 forks source link

Add GitHub actions: archives, cronjobs #26

Closed emmambd closed 2 years ago

emmambd commented 2 years ago

What problem are we trying to solve? Users need to get the latest dataset of a source, which means modifying the current data pipeline and extracting the latest dataset and its bounding box.

How will we know when this is done? There is a latest dataset URL available for each source. A cronjob runs daily to check for the most recent update.

Constraints

maximearmstrong commented 2 years ago

After discussion, here are the initial 3 workflows we will do:

  1. Store the latest dataset on approval:

    • For each added or modified file under catalogs/sources/gtfs/schedule:
      • If the auto-discovery URL is a readable dataset, download and store the dataset in the bucket using the source filename. Otherwise, raise an error to prevent merging the PR.
      • Use the latest URL to test downloading the latest dataset
      • If the dataset downloaded from the latest url is not readable, raise an error to prevent merging the PR.
  2. Store the latest dataset using a daily cronjob:

    • For each file under catalogs/sources/gtfs/schedule:
      • Download the dataset using the auto-discovery URL
      • If the auto-discovery URL is a readable dataset, download and store the dataset in the bucket using the source filename. Otherwise, don't update the latest url in the bucket and add a problem to the cronjob report.
      • If updated, use the latest URL to test downloading the latest dataset.
      • If the dataset downloaded from the latest url is not readable, add a problem to the cronjob report.
  3. Detect deleted and renamed files on PR(nice-to-have):

    • if at least one file has been deleted or renamed under catalogs/sources/gtfs/schedule, raise an error to prevent merging the PR.

For the first workflow (1), the datasets uploaded to the bucket will overwrite the previous ones because we are using the source filename to identify the datasets. This is okay since we will make sure adding a new source will not overwrite another source in the catalogs.