cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: gtfs_views.gtfs_schedule_dim_feeds fails to create unique feed_key #1079

Closed evansiroky closed 2 years ago

evansiroky commented 2 years ago

Describe the bug

Since 2022-02-11, the DAG gtfs_views.gtfs_schedule_dim_feeds has failed an assertion error that desires all created feed_keys to be unique.

To Reproduce

2022-02-11 try 2 logs:

[2022-02-13 01:54:52,113] {logging_mixin.py:109} INFO - field test passed 0 feed_key check_null True 0 feed_key check_unique False

Expected behavior

Assertion errors should pass.

Additional context

This error began happening after https://github.com/cal-itp/data-infra/pull/1070 was merged in, but maybe that is unrelated.

lauriemerrell commented 2 years ago

There was another DAG failure that I suspect was related -- gtfs_schedule_fact_daily_trips failed the trip_key, service_date, service_id check_composite_unique check.

lauriemerrell commented 2 years ago

Investigation notes:

So.... It's not a problem with how we updated agencies.yml. But we need to figure out how to handle cases like this.... Not totally unlike the whitespace situation. The underlying data is "wrong" but it breaks our pipeline.

lauriemerrell commented 2 years ago

I think that this should be fixed alongside the whitespace fix in #1022 -- I believe that we'd want to change all the clean tables to use select distinct.