CDCgov / RecordLinker

The RecordLinker is a service that links records from two datasets based on a set of common attributes. The service is designed to be used in a variety of public health contexts, such as linking patient records from different sources or linking records from different public health surveillance systems.
https://cdcgov.github.io/RecordLinker/
Apache License 2.0
2 stars 0 forks source link

Duplicate patients with multiple last names not matching #17

Closed ericbuckley closed 1 month ago

ericbuckley commented 1 month ago

Summary

When sending through the attached patient payload twice to the link-record endpoint, both calls result in a "match not found" result. The algorithm should find the second patient record, which has the exact same PII as the first, matches the person generated in the first call.

Impact

This could be a bug in the algorithm, but it could also be a design flaw of DIBBS_BASIC. We need to investigate further to see what exactly is causing this to be not a match.

Steps to reproduce

  1. Start up the API server using scripts/local_server.sh
  2. Download the attached Patient bundle json file
  3. Send the payload to the API server with curl a. curl -X POST --header "Content-Type: application/json" -d @bundle.json "http://localhost:8000/link-record"
  4. The response should indicate no match was found (this is correct since the database was empty)
  5. Edit the bundle.json file to have a new UUID on line 9
  6. Repeat step 3
  7. The response should indicate no match was found (this is incorrect since we should match the existing patient in the database)

Expected behavior

No match on the first call and a match on the second call

Additional context

bundle.json

alhayward commented 1 month ago

Report

Summary

Approach

Findings

When reproducing the error above, the DIBBs Basic algorithm resulted in not a match (incorrect), while the DIBBs Enhanced algorithm resulted in a match (correct).

DIBBs Basic Pass 1

Pass 2

DIBBs Enhanced Pass 1

Pass 2

Why Was it Not a Match? (DIBBs Basic)

Let's call the original bundle.json Patient record i.

The RL service expects every payload to have a unique id (otherwise it throws an error). After seeding the MPI with the original Patient record i, we change the id of the record in order to send it back to test. Let's call this Patient record j.

The results of the RL algorithm when running it on Patient record j (the exact same as Patient record i, just with a different id) are detailed below.

Note: Interestingly, when first seeding the MPI with Patient record i, the RL algorithm stores 1 Patient record (with 1 unique Person ID). However, during matching, the algorithm evaluates this single Patient record i 2 times: once with the 'official' name and once with the 'maiden' name, both stored in bundle.json. Both these candidates have the same Patient ID and Person ID. That tells us that, if a single incoming Patient record has 2 names stored (e.g., Patient.name.where(use='official') and Patient.name.where(use='maiden')), the RL algorithm considers this both data during matching.

Pass 1

Blocking

When Blocking, the RL algorithm successfully identified Patient record i as a candidate for Patient record j (they have the exact same DOB, last 4 of MRN, and Sex).

Matching

1st Candidate ('official' name)

First Name

Patient Record i

Patient Record j

When comparing "Verónica383 Eva64" and "Verónica383 Eva64 Verónica383 Eva64", the similarity score is 0.8971428571428571. This is less than the threshold 0.9, so Patient record i and Patient record j are deemed not a match in First Name.

Last Name

Patient Record i

Patient Record j

When comparing "Reynoso837" and "Reynoso837", the similarity score is 1.0 (exact match). Since the evaluation match function used is feature_match_exact(), the similarity score is not evaluated against a threshold; the result is either a match (1.0) or not (< 1.0). Thus, Patient record i and Patient record j are deemed a match in Last Name.

Pass 1, 1st Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 1, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: First Name and Last Name. Because Patient record i and Patient record j were deemed a match in Last Name but not a match in First Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.

2nd Candidate ('maiden' name)

Interestingly, the RL algorithm runs Pass 1 again on another set of data: the same Patient record i, but with the 'maiden' Last Name (rather than the 'official' Last Name, which was used above; 'official' and 'maiden' Last Name are stored separately in the bundle.json.)

First Name

Patient Record i

Patient Record j

Again, when comparing "Verónica383 Eva64" and "Verónica383 Eva64 Verónica383 Eva64", the similarity score is 0.8971428571428571. This is less than the threshold 0.9, so Patient record i and Patient record j are deemed not a match in First Name.

Last Name

Patient Record i

Patient Record j

When comparing "Reynoso837" and "Arenas932", the similarity score is 0.0 (non-match). Thus, Patient record i and Patient record j are deemed a not a match in Last Name.

Pass 1, 2nd Candidate: Match or Non-Match?

Because Patient record i and Patient record j were deemed not a match in both First Name and Last Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.

Pass 2

Blocking

When Blocking, the RL algorithm successfully identified Patient record i as a candidate match for Patient record j (they have the exact same Zip Code, first 4 of First Name, first 4 of Last Name, and Sex).

Matching

1st Candidate ('official' name)

Address

Patient Record i

Patient Record j

When comparing "240 Rippin Ranch Apt 66" and "240 Rippin Ranch Apt 66", the similarity score is 1.0 (exact match). This is greater than the threshold 0.9, so Patient record i and Patient record j are deemed a match in Address.

DOB

Patient Record i

Patient Record j

When comparing 1980-09-05 and 1980-09-05, for some reason the similarity score is 0.0 (non-match). 👈🏼 bug

🐛 This is a typing issue; because the values are not of the same data type (Patient record i's DOB is of type string vs. Patient record j's DOB is of typedatetime.time), their equality is evaluated as false. The code inconsistently transforms DOB values from string to datetime.time; this can be optimized. IMO, typing (and other data transformations/normalization) should occur outside of matching functionality.

Pass 2, 1st Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.

2nd Candidate ('maiden' name)

The results are same as above using 'maiden' name, since Last Name is not evaluated for matching in Pass 2.

🐛 Note, however, that this may identify a bug in the Blocking code: a record with Last Name of "Arenas932" ("maiden") should not have been identified as a candidate in Pass 2, because Blocking should be on first 4 of Last Name (i.e., candidates should only include records that match exactly on first 4 of Last Name, and "Aren" != "Reyn"). This may still be valid though, if the logic is set up such that if a Patient record matches any value within a Blocking field that has multiple values (i.e., values for first 4 of Last Name are ["Aren", "Reyn"], and while "Aren" != "Reyn", "Reyn" != "Reyn"), then they are considered a candidate, and all values are evaluated.

Pass 2, 2nd Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.

Why Was it Not a Match? (DIBBs Enhanced)

Using the DIBBs Enhanced algorithm, the results are the same as above except that Patient records match on DOB (correct). This is because the DIBBs Enhanced algorithm does not use feature_match_exact(), where the date typing bug was, but instead uses feature_match_log_odds_fuzzy_compare(), which does account for date type normalization.

Because the matching function used by the DIBBs Enhanced algorithm did not suffer from the same bug as that used by the DIBBs Basic, it ultimately corrected matched Patient record i and Patient record j due to their match on DOB (which has a high log odds score, 10.126641103800338, boosting its DOB match score for above the DOB match threshold - and ultimately boosting its overall match score along with the match on Address, which similarly has a high log odds score, 8.438284928858774. In fact, DOB and address have the highest log odds scores out of all match fields. Therefore, because the Address and DOB values were exact matches in Pass 2, they benefited from the full log odds scores for DOB and Address:

1.0 exact match similarity score for DOB x 10.126641103800338 log odds score for DOB + 1.0 exact match similarity score for Address x 8.438284928858774 log odds score for Address = 18.564926032659113, which is greater than 17.0, the true match threshold for Pass 2 of DIBBs Enhanced algorithm.

Thus, Patient record i and Patient record j are deemed a match by the DIBBs Enhanced algorithm. However, the DIBBs Enhanced algorithm still suffers from the parsing issue with First and Last Name as the DIBBs Basic algorithm; however, because First and Last Name aren't evaluated in Pass 2 matching, and the DIBBs Enhanced algorithm didn't suffer from the typing bug, the algorithm deemed a match.

Solution

ericbuckley commented 1 month ago

@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the feature_match_exact function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?

alhayward commented 1 month ago

@ericbuckley The above comment is a draft - will continue to update with my findings and address this question!

alhayward commented 1 month ago

@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the feature_match_exact function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?

@ericbuckley Correct, the birthdate parsed from the incoming Patient record is of type str, whereas the birthdate parsed from the existing Patient record in the MPI is of type datetime.date.