cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 14 forks source link

Explore missing data and fan out in payments_rides due to customer data issues #1291

Closed mjumbewu closed 2 years ago

mjumbewu commented 2 years ago

Background

The funding_source_vault_id field is used to join a micropayment to a customer's card information. The vault IDs that correspond to a customer can unfortunately change over time. As a result, we are building a historical table to keep track of the date/times that vault ids are valid in stg_cleaned_customer_funding_source_vaults. However, we have hit a case where two vault IDs were valid for a customer ID during a specific time window. This caused fan out in the payments_rides table, and caused us to fail the uniqueness check for both micropayment_id and littlepay_transaction_id. The fan out is currently (as of March 31, 2022) very limited (resulting in one duplicated row across the entire table).

The stg_cleaned_customer_funding_source_vaults query needs to be fixed, likely by loosening the relationship between customers and vaults.

Acceptance Criteria


Details

This is due to (overzealous?) entity resolution. LittlePay has a customer_funding_source_vaults table, from which we are attempting to abstract out funding_sources and customers. However, there is a recent batch of data that's causing issues. Specifically, related to the clean-air-express rider with `funding_source_fault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'.

Running the following in BigQuery on 29 March 2022, we get back a single row that shows up twice in the table:

select distinct *, count(*) over (partition by micropayment_id) as micropayment_id_count
from views.payments_rides
qualify micropayment_id_count > 1
micropayment_id_count participant_id micropayment_id funding_source_vault_id customer_id principal_customer_id bin masked_pan card_scheme issuer issuer_country form_factor charge_amount refund_amount nominal_amount charge_type adjustment_id adjustment_type adjustment_time_period_type adjustment_description adjustment_amount product_id product_code product_description product_type route_id route_long_name route_short_name direction vehicle_id littlepay_transaction_id device_id transaction_type transaction_outcome transaction_date_time_utc transaction_date_time_pacific location_id location_name latitude longitude off_littlepay_transaction_id off_device_id off_transaction_type off_transaction_outcome off_transaction_date_time_utc off_transaction_date_time_pacific off_location_id off_location_name off_latitude off_longitude
2 clean-air-express 9050f535-28cb-47ae-8757-c44034c6d941 8238daf3-065f-4e17-8565-0ebdca14d4b8 e5055a14-55c9-4919-a294-9a8c347f86dc e5055a14-55c9-4919-a294-9a8c347f86dc 409790 409790**8621 VISA JPMORGAN CHASE BANK, N.A. UNITED STATES 4 6 flat_fare 4b682837-e409-4ffd-8644-28be2188ef68 DAILY_CAP Daily cap10 dollars 2 72f48cba-86fb-4c5d-980f-a2b0f551e84a SBCAG-DAILY-TENDOLLARS Daily cap10 dollars CAPPING Route Z inbound 100109191057300177 d4f53c19-8890-4bf5-8543-59bc83237869 61002 single allow 2022-03-22T23:29:40.000Z 2022-03-22T16:29:40 1567522 GutrzNop IO 34.41992199999999 -119.68999649999999

With some further investigation, I found that this row is duplicated due to fan out from joining on the stg_cleaned_customer_funding_source_vaults table. Below we can see the records we've pulled from the raw data that correspond to `funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8':

select *
from `payments.stg_cleaned_customer_funding_source_vaults`
where funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'
order by calitp_valid_at
customer_id funding_source_vault_id principal_customer_id bin masked_pan card_scheme issuer issuer_country form_factor calitp_valid_at calitp_invalid_at
e5055a14-55c9-4919-a294-9a8c347f86dc 8238daf3-065f-4e17-8565-0ebdca14d4b8 e5055a14-55c9-4919-a294-9a8c347f86dc 409790 409790**8621 VISA JPMORGAN CHASE BANK, N.A. UNITED STATES 1899-01-01T00:00:00Z 2022-03-23T05:06:00Z
99944908-58be-4fb3-864f-9bbc24305527 8238daf3-065f-4e17-8565-0ebdca14d4b8 fc85b752-43ea-430d-b634-78350e9a9952 409790 409790**8621 VISA JPMORGAN CHASE BANK, N.A. UNITED STATES 2022-03-15T05:10:00Z 2099-01-01T00:00:00Z
e5055a14-55c9-4919-a294-9a8c347f86dc 8238daf3-065f-4e17-8565-0ebdca14d4b8 e5055a14-55c9-4919-a294-9a8c347f86dc 409790 409790**8621 VISA JPMORGAN CHASE BANK, N.A. UNITED STATES 2022-03-23T05:06:00Z 2022-03-15T05:10:00Z

Above there are two funding source vault records that are valid for this id between 2022-03-15 and 2022-03-23 (there should only be one).

mjumbewu commented 2 years ago

For reference, the following are all of the principal_customer_id values that are associated with the funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8' over time:

with

customer_ids_for_vault as (
    select distinct principal_customer_id
    from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
    where funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'
)

select distinct
    funding_source_vault_id,
    principal_customer_id,
    masked_pan,
    calitp_export_datetime
from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
where principal_customer_id in (select principal_customer_id from customer_ids_for_vault)
order by masked_pan, calitp_export_datetime
funding_source_vault_id principal_customer_id masked_pan calitp_export_datetime
8238daf3-065f-4e17-8565-0ebdca14d4b8 fc85b752-43ea-430d-b634-78350e9a9952 409790**8621 2022-03-15T05:10:00
8238daf3-065f-4e17-8565-0ebdca14d4b8 e5055a14-55c9-4919-a294-9a8c347f86dc 409790**8621 2022-03-23T05:06:00
5f21e7ed-075a-4834-84ff-f2f1d89476f7 fc85b752-43ea-430d-b634-78350e9a9952 464248**2437 2021-12-31T05:44:00
5f21e7ed-075a-4834-84ff-f2f1d89476f7 e5055a14-55c9-4919-a294-9a8c347f86dc 464248**2437 2022-03-23T05:06:00
52d2cf80-8172-49f0-b13b-7c0bc8bcc70c e5055a14-55c9-4919-a294-9a8c347f86dc 483316**9584 2022-03-23T05:06:00
52d2cf80-8172-49f0-b13b-7c0bc8bcc70c e5055a14-55c9-4919-a294-9a8c347f86dc 483316**9584 2022-03-25T05:06:00

Based on this, the story that I'm seeing is that each card is essentially associated with one funding source vault, and multiple cards might be associated with one principal customer. The cards 409790******8621 and 464248******2437 were associated with customer fc85b752-43ea-430d-b634-78350e9a9952 up until 2022-03-23. After 2022-03-23, those two cards as well as 483316******9584 were then associated with e5055a14-55c9-4919-a294-9a8c347f86dc.

Looking further, it seems like the 1:1 relationship between funding source vaults and CC numbers holds for all the data we have so far. The following query provides evidence for this by counting the number of masked_pan values for each funding_source_vault_id:

with

distinct_vaults as (
    select distinct
        funding_source_vault_id,
        -- Some of the PAN values are masked with dashes and some with
        -- asterisks. The following just strips out both.
        replace(replace(masked_pan, '-', ''), '*', '') as masked_pan
    from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
)

select *, count(*) over (partition by funding_source_vault_id) as pan_count
from distinct_vaults
qualify pan_count > 1
order by masked_pan

This query comes back empty, which means that each vault id corresponds to only one PAN (note that some masked PANs correspond to multiple vaults, but since we don't have full card numbers we can reasonably guess that the masked portion differs in these cases).

The next thing to investigate would be whether we can assume that any two customers (focussing on principal_customer_ids) that correspond to a funding_source_vault_id can be considered the same customer.

mjumbewu commented 2 years ago

Based on investigation using the payments.stg_enriched_customer_funding_source table, the relationships I see between the relevant ID fields (funding_source_vault_id, customer_id, principal_customer_id) is:

erDiagram
          Vault ||..|| Credit-Card : "represents a"
          Customer }|--|{ Vault : ""
          Principal-Customer }|--|{ Customer : ""
          Principal-Customer }|--|{ Vault : ""

Frustratingly, at first glance, all relationships are many-to-many. However, this may be because customer_id and principal_customer_id values may change over time. So even though, for example, a customer_id may be related to multiple principal_customer_id values over all time, what I have not yet verified is whether a customer_id is only related to one principal_customer_id at a time.

What I expect is:

If the above is the case, then any two principal_customer_id values that are associated with the same customer_id are the same Principal, and any two principal_customer_id values that are associated with the same funding_source_vault_id are the same Principal.

Part of the challenge in figuring out whether the above expected relationships hold is in how to determine when an ID is invalidated/replaced by a new version. Maybe we can just ask Littlepay if the above relationships should hold and then operate based on their answers.


I used queries such as this to explore cardinality:

with

relationships as (
    select
        customer_id,
        principal_customer_id,
        -- The following represent the first/last time we saw this relationship
        -- in an export from Littlepay.
        min(calitp_export_datetime) as earliest_exported_at,
        max(calitp_export_datetime) as latest_exported_at
    from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
    where principal_customer_id is not null
    group by 1, 2
),

counts as (
  select *,
      count(*) over (partition by customer_id) as principal_count,
      count(*) over (partition by principal_customer_id) as customer_count
  from relationships
)

/* Show which customer_ids are associated with multiple principal_customer_ids */
select * from counts
order by principal_count desc, customer_id, earliest_exported_at

/* Show which principal_customer_ids are associated with multiple customer_ids */
-- select * from counts
-- order by customer_count desc, principal_customer_id, earliest_exported_at
j-meelah commented 2 years ago

The findings were: