cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 14 forks source link

Bug: Need to investigate apparent duplicate data #379

Closed mjumbewu closed 3 years ago

mjumbewu commented 3 years ago

Describe the bug

In both the device_transactions and micropayments tables we are seeing multiple records that have the same unique key (or what we thought were unique keys). We need to know:

As an example, the micropayment with micropayment_id = '000317e7-220f-4c09-817c-b32a7b308812' occurs in the _gs://gtfs-data/mst/processed/micropayments/2021-09-13_202107190532micropayments.psv file twice:

So, in that case, even if we deleted all records before inserting, we'd still end up with a duplicate micropayments_id.

For the device_transactions data, there aren't any duplicate littlepay_transaction_id values within the same file. However there are still plenty of duplicates across different files. These duplicates appear to have different location_id values, so the records aren't complete duplicates (have they been changed/corrected between files?).

To Reproduce

Expected behavior We expected the micropayments.micropayment_id and device_transactions.littlepay_transaction_id fields to be unique.

mjumbewu commented 3 years ago

An additional piece of information about the device_transaction duplicates: as is mentioned in this slack thread, as of right now (Sep 15, 10:10 AM) there are 813 duplicate littlepay_transaction_id values in device_transactions, and for all but 22 of those, at least one of the duplicate records has a route_id value of 'Route Z', which is apparently an "unidentified" route.

machow commented 3 years ago

See this gist from late September:

https://gist.github.com/mjumbewu/ac1e2e56bac5eb6a6e2ae2568303a8a8

mjumbewu commented 3 years ago

Closing with follow-up work to be done in https://github.com/cal-itp/data-infra/issues/596