ericbuckley commented 1 month ago

Summary

When sending through the attached patient payload twice to the link-record endpoint, both calls result in a "match not found" result. The algorithm should find the second patient record, which has the exact same PII as the first, matches the person generated in the first call.

Impact

This could be a bug in the algorithm, but it could also be a design flaw of DIBBS_BASIC. We need to investigate further to see what exactly is causing this to be not a match.

Steps to reproduce

Start up the API server using scripts/local_server.sh
Download the attached Patient bundle json file
Send the payload to the API server with curl a. curl -X POST --header "Content-Type: application/json" -d @bundle.json "http://localhost:8000/link-record"
The response should indicate no match was found (this is correct since the database was empty)
Edit the bundle.json file to have a new UUID on line 9
Repeat step 3
The response should indicate no match was found (this is incorrect since we should match the existing patient in the database)

Expected behavior

No match on the first call and a match on the second call

Additional context

bundle.json

alhayward commented 1 month ago

Report

Summary

Multiple Names Stored as Separate Patient Records. In the DIBBs Basic algorithm, when an incoming Patient record has multiple names stored (e.g., Patient.name.where(use='official') and Patient.name.where(use='maiden')), the MPI writes multiple Patient records, one with each name. Each of these records has different name data, but all have the same Person ID. Additionally, if one Patient record in a Person cluster is successfully blocked on, all Patient records in that Person cluster are collected and evaluated (to determine belongingness_ratio).
Unspecific Parsing. To parse First Name from an incoming Patient record, we use the Python package fhirpathpy. Specifically, we use fhirpathpy.compile(path), where path is the FHIR path at which the value can be found in the FHIR resource. To extract First Name, we pass fhirpathpy.compile(Patient.name.given) and select all values returned. Because we don't specify which name to use (.where(use='official') or .where(use='maiden')) and our selection criteria is all, all given names are fetched (including both given names stored as 'official' and 'maiden'). In the case above where a Patient record has an 'official' name stored and a 'maiden' name stored, where the last names are different but given names are the same, we see unexpected behavior: the parsed First Name value is 'Verónica383 Eva64 Verónica383 Eva64' (after str concatenation), since selection criteria is all. This unexpected behavior applies in both the DIBBs Basic and the DIBBs Enhanced algorithms. (See utils.extract_value_with_resource_path() for more info.)
Typing Bug in feature_match_exact(). The DIBBs Basic algorithm uses feature_match_exact() as a matching function. Unlike the other matching functions, when parsing date values from the incoming Patient record (e.g., DOB), this function did not normalize the data type to datetime.date. This created conditions such that, when queried from an existing Patient record in the MPI, DOB is of type datetime.time, but when parsed from an incoming Patient record, DOB is of type str. When using the DIBBs Basic algorithm, this caused a false equality when evaluating DOB in feature_match_exact(), because the data types are different. This bug does not apply to the DIBBs Enhanced algorithm, because the algorithm does not use feature_match_exact() (it uses feature_match_log_odds_fuzzy_compare(), which does include this date type normalization).
Explainability. This analysis points to the broader need for more explainability in the Record Linker algorithm down the line (why was this a match/non-match?). Users should be able to understand why the algorithm deemed two Patient records a match or not, and more broadly, why an incoming Patient record was matched or not matched to an existing Person cluster.

Approach

Used the Python logging module to log the similarity scores and resulting match/non-match decisions for each of the fields evaluated for matching. This provides transparency & explainability into why the RL algorithm determined two Patient records were a match or not.

Findings

When reproducing the error above, the DIBBs Basic algorithm resulted in not a match (incorrect), while the DIBBs Enhanced algorithm resulted in a match (correct).

DIBBs Basic Pass 1

Block On: DOB, last 4 of MRN, Sex
Match On: First Name (fuzzy), Last Name (exact)

Pass 2

Block On: Zip Code, first 4 of First Name, first 4 of Last Name, Sex
Match On: Address (fuzzy), DOB (exact)

DIBBs Enhanced Pass 1

Block On: DOB, last 4 of MRN, Sex
Match On: First Name (fuzzy), Last Name (fuzzy)

Pass 2

Block On: Zip Code, first 4 of First Name, first 4 of Last Name, Sex
Match On: Address (fuzzy), DOB (fuzzy)

Why Was it Not a Match? (DIBBs Basic)

Let's call the original bundle.json Patient record i.

The RL service expects every payload to have a unique id (otherwise it throws an error). After seeding the MPI with the original Patient record i, we change the id of the record in order to send it back to test. Let's call this Patient record j.

The results of the RL algorithm when running it on Patient record j (the exact same as Patient record i, just with a different id) are detailed below.

Note: Interestingly, when first seeding the MPI with Patient record i, the RL algorithm stores 1 Patient record (with 1 unique Person ID). However, during matching, the algorithm evaluates this single Patient record i 2 times: once with the 'official' name and once with the 'maiden' name, both stored in bundle.json. Both these candidates have the same Patient ID and Person ID. That tells us that, if a single incoming Patient record has 2 names stored (e.g., Patient.name.where(use='official') and Patient.name.where(use='maiden')), the RL algorithm considers this both data during matching.

Pass 1

Blocking

When Blocking, the RL algorithm successfully identified Patient record i as a candidate for Patient record j (they have the exact same DOB, last 4 of MRN, and Sex).

Matching

1st Candidate (`'official'` name)

First Name

Patient Record i

First Name: "Verónica383 Eva64"

Patient Record j

First Name: "Verónica383 Eva64 Verónica383 Eva64" 👈🏼 bug

🐛 Before concatenating to a string, we should ensure the list of First Name values are only those associated with the respective Last Name they are stored with (in this case, either 'official' or 'maiden', but not both/all).

When comparing "Verónica383 Eva64" and "Verónica383 Eva64 Verónica383 Eva64", the similarity score is 0.8971428571428571. This is less than the threshold 0.9, so Patient record i and Patient record j are deemed not a match in First Name.

Last Name

Patient Record i

Last Name: "Reynoso837"

Patient Record j

Last Name: "Reynoso837"

When comparing "Reynoso837" and "Reynoso837", the similarity score is 1.0 (exact match). Since the evaluation match function used is feature_match_exact(), the similarity score is not evaluated against a threshold; the result is either a match (1.0) or not (< 1.0). Thus, Patient record i and Patient record j are deemed a match in Last Name.

Pass 1, 1st Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 1, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: First Name and Last Name. Because Patient record i and Patient record j were deemed a match in Last Name but not a match in First Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.

2nd Candidate (`'maiden'` name)

Interestingly, the RL algorithm runs Pass 1 again on another set of data: the same Patient record i, but with the 'maiden' Last Name (rather than the 'official' Last Name, which was used above; 'official' and 'maiden' Last Name are stored separately in the bundle.json.)

First Name

Patient Record i

First Name: "Verónica383 Eva64"

Patient Record j

First Name: "Verónica383 Eva64 Verónica383 Eva64" 👈🏼 bug (same as above)

Again, when comparing "Verónica383 Eva64" and "Verónica383 Eva64 Verónica383 Eva64", the similarity score is 0.8971428571428571. This is less than the threshold 0.9, so Patient record i and Patient record j are deemed not a match in First Name.

Last Name

Patient Record i

Last Name: "Reynoso837"

Patient Record j

Last Name: "Arenas932"

When comparing "Reynoso837" and "Arenas932", the similarity score is 0.0 (non-match). Thus, Patient record i and Patient record j are deemed a not a match in Last Name.

Pass 1, 2nd Candidate: Match or Non-Match?

Because Patient record i and Patient record j were deemed not a match in both First Name and Last Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.

Pass 2

Blocking

When Blocking, the RL algorithm successfully identified Patient record i as a candidate match for Patient record j (they have the exact same Zip Code, first 4 of First Name, first 4 of Last Name, and Sex).

Matching

1st Candidate (`'official'` name)

Address

Patient Record i

Address: "240 Rippin Ranch Apt 66"

Patient Record j

Address: "240 Rippin Ranch Apt 66"

When comparing "240 Rippin Ranch Apt 66" and "240 Rippin Ranch Apt 66", the similarity score is 1.0 (exact match). This is greater than the threshold 0.9, so Patient record i and Patient record j are deemed a match in Address.

DOB

Patient Record i

DOB: 1980-09-05

Patient Record j

DOB: 1980-09-05

When comparing 1980-09-05 and 1980-09-05, for some reason the similarity score is 0.0 (non-match). 👈🏼 bug

🐛 This is a typing issue; because the values are not of the same data type (Patient record i's DOB is of type string vs. Patient record j's DOB is of typedatetime.time), their equality is evaluated as false. The code inconsistently transforms DOB values from string to datetime.time; this can be optimized. IMO, typing (and other data transformations/normalization) should occur outside of matching functionality.

Pass 2, 1st Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.

2nd Candidate (`'maiden'` name)

The results are same as above using 'maiden' name, since Last Name is not evaluated for matching in Pass 2.

🐛 Note, however, that this may identify a bug in the Blocking code: a record with Last Name of "Arenas932" ("maiden") should not have been identified as a candidate in Pass 2, because Blocking should be on first 4 of Last Name (i.e., candidates should only include records that match exactly on first 4 of Last Name, and "Aren" != "Reyn"). This may still be valid though, if the logic is set up such that if a Patient record matches any value within a Blocking field that has multiple values (i.e., values for first 4 of Last Name are ["Aren", "Reyn"], and while "Aren" != "Reyn", "Reyn" != "Reyn"), then they are considered a candidate, and all values are evaluated.

Pass 2, 2nd Candidate: Match or Non-Match?

In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.

Why Was it Not a Match? (DIBBs Enhanced)

Using the DIBBs Enhanced algorithm, the results are the same as above except that Patient records match on DOB (correct). This is because the DIBBs Enhanced algorithm does not use feature_match_exact(), where the date typing bug was, but instead uses feature_match_log_odds_fuzzy_compare(), which does account for date type normalization.

Because the matching function used by the DIBBs Enhanced algorithm did not suffer from the same bug as that used by the DIBBs Basic, it ultimately corrected matched Patient record i and Patient record j due to their match on DOB (which has a high log odds score, 10.126641103800338, boosting its DOB match score for above the DOB match threshold - and ultimately boosting its overall match score along with the match on Address, which similarly has a high log odds score, 8.438284928858774. In fact, DOB and address have the highest log odds scores out of all match fields. Therefore, because the Address and DOB values were exact matches in Pass 2, they benefited from the full log odds scores for DOB and Address:

1.0 exact match similarity score for DOB x 10.126641103800338 log odds score for DOB + 1.0 exact match similarity score for Address x 8.438284928858774 log odds score for Address = 18.564926032659113, which is greater than 17.0, the true match threshold for Pass 2 of DIBBs Enhanced algorithm.

Thus, Patient record i and Patient record j are deemed a match by the DIBBs Enhanced algorithm. However, the DIBBs Enhanced algorithm still suffers from the parsing issue with First and Last Name as the DIBBs Basic algorithm; however, because First and Last Name aren't evaluated in Pass 2 matching, and the DIBBs Enhanced algorithm didn't suffer from the typing bug, the algorithm deemed a match.

Solution

Normalize DOB type. For now, normalize DOB of incoming Patient record to datetime.time upon match evaluation, before comparing with DOB of existing Patient record. In the future, however, we should standardize DOB type elsewhere, and only once. (Currently this datetime.time type conversion happens several places. This is best done more DRYly and separately from when matching occurs, along with any other data transformations/normalizations of match fields.)
There is current work to refactor how incoming Patient records are flattened (#23), which should handle cases where a Patient record contains multiple names (e.g., 'official' and 'maiden'). First Name and Last Name values should be grouped accordingly, and the algorithm should be consistent and transparent in which of these name values is used for Blocking.

ericbuckley commented 1 month ago

@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the feature_match_exact function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?

alhayward commented 1 month ago

@ericbuckley The above comment is a draft - will continue to update with my findings and address this question!

alhayward commented 1 month ago

@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the feature_match_exact function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?

@ericbuckley Correct, the birthdate parsed from the incoming Patient record is of type str, whereas the birthdate parsed from the existing Patient record in the MPI is of type datetime.date.

CDCgov / RecordLinker

Duplicate patients with multiple last names not matching #17

Summary

Impact

Steps to reproduce

Expected behavior

Additional context

Report

Summary

Approach

Findings

Why Was it Not a Match? (DIBBs Basic)

Pass 1

Blocking

Matching

1st Candidate (`'official'` name)

First Name

Last Name

Pass 1, 1st Candidate: Match or Non-Match?

2nd Candidate (`'maiden'` name)

First Name

Last Name

Pass 1, 2nd Candidate: Match or Non-Match?

Pass 2

Blocking

Matching

1st Candidate (`'official'` name)

Address

DOB

Pass 2, 1st Candidate: Match or Non-Match?

2nd Candidate (`'maiden'` name)

Pass 2, 2nd Candidate: Match or Non-Match?

Why Was it Not a Match? (DIBBs Enhanced)

Solution

CDCgov / RecordLinker

Duplicate patients with multiple last names not matching #17

Summary

Impact

Steps to reproduce

Expected behavior

Additional context

Report

Summary

Approach

Findings

Why Was it Not a Match? (DIBBs Basic)

Pass 1

Blocking

Matching

1st Candidate ('official' name)

First Name

Last Name

Pass 1, 1st Candidate: Match or Non-Match?

2nd Candidate ('maiden' name)

First Name

Last Name

Pass 1, 2nd Candidate: Match or Non-Match?

Pass 2

Blocking

Matching

1st Candidate ('official' name)

Address

DOB

Pass 2, 1st Candidate: Match or Non-Match?

2nd Candidate ('maiden' name)

Pass 2, 2nd Candidate: Match or Non-Match?

Why Was it Not a Match? (DIBBs Enhanced)

Solution

1st Candidate (`'official'` name)

2nd Candidate (`'maiden'` name)

1st Candidate (`'official'` name)

2nd Candidate (`'maiden'` name)