Open lauriemerrell opened 1 year ago
This is still a low priority issue. To identify affected URLs, can use:
SELECT base64_url, COUNT(DISTINCT gtfs_dataset_key) AS ct
FROM `cal-itp-data-infra.mart_gtfs.fct_daily_schedule_feeds`
GROUP BY 1
HAVING ct = 0
Currently there are URLs that appeared in agencies.yml for which we have GTFS data in the warehouse but where that URL never appeared in a
gtfs dataset
record in Airtable (because of data fully missing in Airtable or simply because the URL changed over time and the old URL was only in Airtable during the period of time before we started snapshotting Airtable.)To make sure that all downloaded data can be mapped to something, we would like to create dbt seeds that would manually map these URLs to an Airtable record. For datasets that were fully deleted, we can create a Historic record in Airtable and reference that (there is precedent for this), and for datasets that still exist but with a different URL we can map to that record.
How to implement this may depend on if/when we backfill Airtable; some of this might be ameliorated if we just backfill Airtable and get more history, but I think this issue will persist even after backfill.