cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Add dbt seeds for URLs that never appeared in Airtable #2178

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

Currently there are URLs that appeared in agencies.yml for which we have GTFS data in the warehouse but where that URL never appeared in a gtfs dataset record in Airtable (because of data fully missing in Airtable or simply because the URL changed over time and the old URL was only in Airtable during the period of time before we started snapshotting Airtable.)

To make sure that all downloaded data can be mapped to something, we would like to create dbt seeds that would manually map these URLs to an Airtable record. For datasets that were fully deleted, we can create a Historic record in Airtable and reference that (there is precedent for this), and for datasets that still exist but with a different URL we can map to that record.

How to implement this may depend on if/when we backfill Airtable; some of this might be ameliorated if we just backfill Airtable and get more history, but I think this issue will persist even after backfill.

lauriemerrell commented 1 year ago

This is still a low priority issue. To identify affected URLs, can use:

SELECT base64_url, COUNT(DISTINCT gtfs_dataset_key) AS ct
FROM `cal-itp-data-infra.mart_gtfs.fct_daily_schedule_feeds`
GROUP BY 1 
HAVING ct = 0