Describe the bug
As described in #914, the Roseville (271) and SacRT (273) GTFS feeds have inconsistent leading whitespace in their service_id fields in calendar.txt and calendar_dates.txt. This leads to a failed join in our pipeline so these agencies do not have their active service hours populated in cal-itp-data-infra.views.gtfs_schedule_fact_daily_service or views.gtfs_schedule_stg_daily_service.
SELECT * FROMcal-itp-data-infra.views.gtfs_schedule_fact_daily_service WHERE calitp_itp_id = 273
SELECT * FROMcal-itp-data-infra.views.gtfs_schedule_fact_daily_trips WHERE calitp_itp_id = 273
Expected behavior
We want to have SacRT's service appear in these tables.
Proposed solution
Per discussion, we would like to add a specific fix just for Roseville (271) and SacRT (273) where we trim whitespace from service_id for these agencies before performing this join to address this known issue with their data.
Describe the bug As described in #914, the Roseville (271) and SacRT (273) GTFS feeds have inconsistent leading whitespace in their
service_id
fields incalendar.txt
andcalendar_dates.txt
. This leads to a failed join in our pipeline so these agencies do not have their active service hours populated incal-itp-data-infra.views.gtfs_schedule_fact_daily_service
orviews.gtfs_schedule_stg_daily_service
.We believe the affected line of code is: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_views/gtfs_schedule_fact_daily_trips.sql#L52
To Reproduce Steps to reproduce the behavior:
SELECT * FROM
cal-itp-data-infra.views.gtfs_schedule_fact_daily_serviceWHERE calitp_itp_id = 273
SELECT * FROM
cal-itp-data-infra.views.gtfs_schedule_fact_daily_tripsWHERE calitp_itp_id = 273
Expected behavior We want to have SacRT's service appear in these tables.
Proposed solution Per discussion, we would like to add a specific fix just for Roseville (271) and SacRT (273) where we trim whitespace from
service_id
for these agencies before performing this join to address this known issue with their data.