Closed evansiroky closed 2 years ago
There was another DAG failure that I suspect was related -- gtfs_schedule_fact_daily_trips
failed the trip_key, service_date, service_id check_composite_unique
check.
Investigation notes:
SELECT feed_key, calitp_itp_id, calitp_url_number, calitp_agency_name, count(calitp_gtfs_schedule_url) as ct FROM cal-itp-data-infra.views.gtfs_schedule_dim_feeds GROUP BY feed_key, calitp_itp_id, calitp_url_number, calitp_agency_name HAVING ct > 1
yields only feed_key
-8555051680192762051
which is Foothill Transit, extracted on Friday 2022-02-11. feed_info.txt
file literally just contains two duplicate, identical lines -- here is their fully raw feed_info.txt
file:
So.... It's not a problem with how we updated agencies.yml
. But we need to figure out how to handle cases like this.... Not totally unlike the whitespace situation. The underlying data is "wrong" but it breaks our pipeline.
I think that this should be fixed alongside the whitespace fix in #1022 -- I believe that we'd want to change all the clean
tables to use select distinct
.
Describe the bug
Since 2022-02-11, the DAG
gtfs_views.gtfs_schedule_dim_feeds
has failed an assertion error that desires all createdfeed_key
s to be unique.To Reproduce
2022-02-11 try 2 logs:
Expected behavior
Assertion errors should pass.
Additional context
This error began happening after https://github.com/cal-itp/data-infra/pull/1070 was merged in, but maybe that is unrelated.