cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: duplicate entries showing up in the fct_transit_data_quality_issues table #3407

Closed evansiroky closed 1 month ago

evansiroky commented 2 months ago

Describe the bug

I believe that the fct_transit_data_quality_issues table should only ever have one entry per Airtable record. However, there are times when multiple records are showing up in the fct_transit_data_quality_issues table. This appears to happen whenever there is a change in either the GTFS Dataset or Service associated with the Airtable issue.

To Reproduce

See metabase question: https://dashboards.calitp.org/question/2516-fct-transit-data-quality-issues-with-duplicate-entries

Expected behavior

Each Airtable record should show up once.

Additional context

This is causing incorrect statistics to show up in the Transit Data Quality Issue Dashboard.

vevetron commented 2 months ago

I think this is because an issue an airtable can have multiple gtfs records (or services) attached to it. e.g. transit issue 1

image

has 5 duplicates but also 5 gtfs records attached to it.

2 of sets of duplicates are due to having multiple services attached to it. It could also be because of issue_type but i'm not seeing that happening because airtable restricts you to only one issue type.

This happens during the unnest step in this table.

Solutions:

  1. I don't think need all the different gtfs_source_ids or services, for the final table, and perhaps we can just pick the first one and do the joins on that.
  2. We could just smash the duplicates later and make the unique field a list of the different entries.
    1. This could cause issues for joining, if anyone wants to join on these fields (not likely)
    2. fields like gtfs_dataset_key_at_creation become less meaningful, but maybe still relevant.
  3. (less desired) Stop airtable from allowing multiple services or gtfs feeds at the source. Clean up the existing ones.
  4. Create a whole new table that's the smashed version without duplicates.

I think 1. is easiest, and preserves the point of this table and the issues.

Separately I added code so the name of the gtfs file and service ends up in the final table.

vevetron commented 1 month ago

In a call @evansiroky said choice 1 is fine.