Closed mjumbewu closed 2 years ago
For reference, the following are all of the principal_customer_id
values that are associated with the funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'
over time:
with
customer_ids_for_vault as (
select distinct principal_customer_id
from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
where funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'
)
select distinct
funding_source_vault_id,
principal_customer_id,
masked_pan,
calitp_export_datetime
from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
where principal_customer_id in (select principal_customer_id from customer_ids_for_vault)
order by masked_pan, calitp_export_datetime
funding_source_vault_id | principal_customer_id | masked_pan | calitp_export_datetime |
---|---|---|---|
8238daf3-065f-4e17-8565-0ebdca14d4b8 | fc85b752-43ea-430d-b634-78350e9a9952 | 409790**8621 | 2022-03-15T05:10:00 |
8238daf3-065f-4e17-8565-0ebdca14d4b8 | e5055a14-55c9-4919-a294-9a8c347f86dc | 409790**8621 | 2022-03-23T05:06:00 |
5f21e7ed-075a-4834-84ff-f2f1d89476f7 | fc85b752-43ea-430d-b634-78350e9a9952 | 464248**2437 | 2021-12-31T05:44:00 |
5f21e7ed-075a-4834-84ff-f2f1d89476f7 | e5055a14-55c9-4919-a294-9a8c347f86dc | 464248**2437 | 2022-03-23T05:06:00 |
52d2cf80-8172-49f0-b13b-7c0bc8bcc70c | e5055a14-55c9-4919-a294-9a8c347f86dc | 483316**9584 | 2022-03-23T05:06:00 |
52d2cf80-8172-49f0-b13b-7c0bc8bcc70c | e5055a14-55c9-4919-a294-9a8c347f86dc | 483316**9584 | 2022-03-25T05:06:00 |
Based on this, the story that I'm seeing is that each card is essentially associated with one funding source vault, and multiple cards might be associated with one principal customer. The cards 409790******8621
and 464248******2437
were associated with customer fc85b752-43ea-430d-b634-78350e9a9952
up until 2022-03-23. After 2022-03-23, those two cards as well as 483316******9584
were then associated with e5055a14-55c9-4919-a294-9a8c347f86dc
.
Looking further, it seems like the 1:1 relationship between funding source vaults and CC numbers holds for all the data we have so far. The following query provides evidence for this by counting the number of masked_pan
values for each funding_source_vault_id
:
with
distinct_vaults as (
select distinct
funding_source_vault_id,
-- Some of the PAN values are masked with dashes and some with
-- asterisks. The following just strips out both.
replace(replace(masked_pan, '-', ''), '*', '') as masked_pan
from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
)
select *, count(*) over (partition by funding_source_vault_id) as pan_count
from distinct_vaults
qualify pan_count > 1
order by masked_pan
This query comes back empty, which means that each vault id corresponds to only one PAN (note that some masked PANs correspond to multiple vaults, but since we don't have full card numbers we can reasonably guess that the masked portion differs in these cases).
The next thing to investigate would be whether we can assume that any two customers (focussing on principal_customer_id
s) that correspond to a funding_source_vault_id
can be considered the same customer.
Based on investigation using the payments.stg_enriched_customer_funding_source
table, the relationships I see between the relevant ID fields (funding_source_vault_id
, customer_id
, principal_customer_id
) is:
erDiagram
Vault ||..|| Credit-Card : "represents a"
Customer }|--|{ Vault : ""
Principal-Customer }|--|{ Customer : ""
Principal-Customer }|--|{ Vault : ""
Frustratingly, at first glance, all relationships are many-to-many. However, this may be because customer_id
and principal_customer_id
values may change over time. So even though, for example, a customer_id
may be related to multiple principal_customer_id
values over all time, what I have not yet verified is whether a customer_id
is only related to one principal_customer_id
at a time.
What I expect is:
If the above is the case, then any two principal_customer_id
values that are associated with the same customer_id
are the same Principal, and any two principal_customer_id
values that are associated with the same funding_source_vault_id
are the same Principal.
Part of the challenge in figuring out whether the above expected relationships hold is in how to determine when an ID is invalidated/replaced by a new version. Maybe we can just ask Littlepay if the above relationships should hold and then operate based on their answers.
I used queries such as this to explore cardinality:
with
relationships as (
select
customer_id,
principal_customer_id,
-- The following represent the first/last time we saw this relationship
-- in an export from Littlepay.
min(calitp_export_datetime) as earliest_exported_at,
max(calitp_export_datetime) as latest_exported_at
from `cal-itp-data-infra.payments.stg_enriched_customer_funding_source`
where principal_customer_id is not null
group by 1, 2
),
counts as (
select *,
count(*) over (partition by customer_id) as principal_count,
count(*) over (partition by principal_customer_id) as customer_count
from relationships
)
/* Show which customer_ids are associated with multiple principal_customer_ids */
select * from counts
order by principal_count desc, customer_id, earliest_exported_at
/* Show which principal_customer_ids are associated with multiple customer_ids */
-- select * from counts
-- order by customer_count desc, principal_customer_id, earliest_exported_at
The findings were:
funding_source_vault_id
value always identifies a unique payment medium, andprincipal_customer_id
values associated with a funding_source_vault_id identify the same principal person/account.
There's one question that I never fully closed: when a customer_id
value is related to multiple principal_customer_id
values, are all of those principals always associated with the same vault/card. If so, we should probably just entirely ignore customer_id
and rely on principal & vault. (To be addressed in https://github.com/cal-itp/data-infra/issues/1467)
Background
The
funding_source_vault_id
field is used to join a micropayment to a customer's card information. The vault IDs that correspond to a customer can unfortunately change over time. As a result, we are building a historical table to keep track of the date/times that vault ids are valid in stg_cleaned_customer_funding_source_vaults. However, we have hit a case where two vault IDs were valid for a customer ID during a specific time window. This caused fan out in thepayments_rides
table, and caused us to fail the uniqueness check for bothmicropayment_id
andlittlepay_transaction_id
. The fan out is currently (as of March 31, 2022) very limited (resulting in one duplicated row across the entire table).The stg_cleaned_customer_funding_source_vaults query needs to be fixed, likely by loosening the relationship between customers and vaults.
Acceptance Criteria
funding_source_vault_id
at any given time.LEFT JOIN
ed inpayments_rides
because we've seen the customer funding source data change on a different cadence than micropayments and transactions. With the left join, credit card information will benull
for a small number of rows, but at least rides won't be missing.Details
This is due to (overzealous?) entity resolution. LittlePay has a
customer_funding_source_vaults
table, from which we are attempting to abstract outfunding_sources
andcustomers
. However, there is a recent batch of data that's causing issues. Specifically, related to the clean-air-express rider with `funding_source_fault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8'.Running the following in BigQuery on 29 March 2022, we get back a single row that shows up twice in the table:
With some further investigation, I found that this row is duplicated due to fan out from joining on the
stg_cleaned_customer_funding_source_vaults
table. Below we can see the records we've pulled from the raw data that correspond to `funding_source_vault_id = '8238daf3-065f-4e17-8565-0ebdca14d4b8':Above there are two funding source vault records that are valid for this id between
2022-03-15
and2022-03-23
(there should only be one).