cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Actually handle cases where GTFS service data records were not one to one #2174

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

In the past, there were some gtfs service data records that had multiple service records or multiple GTFS dataset records entered as part of one gtfs service data record. This breaks our data model. In #2103 I am just arbitrarily keeping one of the pairs (by adding a filter in stg_transit_database__gtfs_service_data), but long term we should find a way to keep both.

Originally written up in #2019: Describe the bug The GTFS service data table in Airtable is supposed to be a one to one relationship between one service and one dataset. However, historically, there have been cases where multiple services were entered as part of a single record (for example, YARTS Schedule on 10/4/22). This causes the dbt_utils_unique_combination_of_columns_stg_transit_database__gtfs_service_data_key__calitp_extracted_at test to fail.

To Reproduce See the data here: gs://calitp-airtable/california_transit__gtfs_service_data/dt=2022-10-04/ts=2022-10-04T02:00:37.652842+00:00/gtfs_service_data.jsonl.gz. If you run dbt test -s stg_transit_database__gtfs_service_data, it will fail because of the multiple service keys that are part of one record here.

Expected behavior We should maybe try to figure out how to expand these records out more gracefully.

Additional context I believe that we have implemented upstream QC processes to check and ensure that this doesn't happen (ensure that GTFS service data relationships are 1:1), so this is mostly just a historical data cleanup issue.

lauriemerrell commented 1 year ago

Leaving this open, it would be a low priority good-to-address-one-day kind of thing.