cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: One-off fix for SacRT and Roseville service_id leading whitespace #943

Closed lauriemerrell closed 2 years ago

lauriemerrell commented 2 years ago

Describe the bug As described in #914, the Roseville (271) and SacRT (273) GTFS feeds have inconsistent leading whitespace in their service_id fields in calendar.txt and calendar_dates.txt. This leads to a failed join in our pipeline so these agencies do not have their active service hours populated in cal-itp-data-infra.views.gtfs_schedule_fact_daily_service or views.gtfs_schedule_stg_daily_service.

We believe the affected line of code is: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_views/gtfs_schedule_fact_daily_trips.sql#L52

To Reproduce Steps to reproduce the behavior:

  1. SacRT (273) shows no service in these queries:

SELECT * FROMcal-itp-data-infra.views.gtfs_schedule_fact_daily_service WHERE calitp_itp_id = 273

SELECT * FROMcal-itp-data-infra.views.gtfs_schedule_fact_daily_trips WHERE calitp_itp_id = 273

Expected behavior We want to have SacRT's service appear in these tables.

Proposed solution Per discussion, we would like to add a specific fix just for Roseville (271) and SacRT (273) where we trim whitespace from service_id for these agencies before performing this join to address this known issue with their data.

lauriemerrell commented 2 years ago

New decision, going the #946 route instead.