Closed ericbuckley closed 1 month ago
Patient.name.where(use='official')
and Patient.name.where(use='maiden')
), the MPI writes multiple Patient records, one with each name. Each of these records has different name data, but all have the same Person ID. Additionally, if one Patient record in a Person cluster is successfully blocked on, all Patient records in that Person cluster are collected and evaluated (to determine belongingness_ratio
).fhirpathpy.compile(path)
, where path
is the FHIR path at which the value can be found in the FHIR resource. To extract First Name, we pass fhirpathpy.compile(Patient.name.given)
and select all values returned. Because we don't specify which name to use (.where(use='official')
or .where(use='maiden')
) and our selection criteria is all, all given names are fetched (including both given names stored as 'official'
and 'maiden'
). In the case above where a Patient record has an 'official'
name stored and a 'maiden'
name stored, where the last names are different but given names are the same, we see unexpected behavior: the parsed First Name value is 'Verónica383 Eva64 Verónica383 Eva64'
(after str concatenation), since selection criteria is all. This unexpected behavior applies in both the DIBBs Basic and the DIBBs Enhanced algorithms. (See utils.extract_value_with_resource_path() for more info.)feature_match_exact()
. The DIBBs Basic algorithm uses feature_match_exact()
as a matching function. Unlike the other matching functions, when parsing date values from the incoming Patient record (e.g., DOB), this function did not normalize the data type to datetime.date
. This created conditions such that, when queried from an existing Patient record in the MPI, DOB is of type datetime.time
, but when parsed from an incoming Patient record, DOB is of type str
. When using the DIBBs Basic algorithm, this caused a false
equality when evaluating DOB in feature_match_exact()
, because the data types are different. This bug does not apply to the DIBBs Enhanced algorithm, because the algorithm does not use feature_match_exact()
(it uses feature_match_log_odds_fuzzy_compare()
, which does include this date type normalization).logging
module to log the similarity scores and resulting match/non-match decisions for each of the fields evaluated for matching. This provides transparency & explainability into why the RL algorithm determined two Patient records were a match or not.When reproducing the error above, the DIBBs Basic algorithm resulted in not a match (incorrect), while the DIBBs Enhanced algorithm resulted in a match (correct).
DIBBs Basic Pass 1
Pass 2
DIBBs Enhanced Pass 1
Pass 2
Let's call the original bundle.json
Patient record i.
The RL service expects every payload to have a unique id
(otherwise it throws an error). After seeding the MPI with the original Patient record i, we change the id
of the record in order to send it back to test. Let's call this Patient record j.
The results of the RL algorithm when running it on Patient record j (the exact same as Patient record i, just with a different id
) are detailed below.
Note: Interestingly, when first seeding the MPI with Patient record i, the RL algorithm stores 1 Patient record (with 1 unique Person ID). However, during matching, the algorithm evaluates this single Patient record i 2 times: once with the
'official'
name and once with the'maiden'
name, both stored inbundle.json
. Both these candidates have the same Patient ID and Person ID. That tells us that, if a single incoming Patient record has 2 names stored (e.g.,Patient.name.where(use='official')
andPatient.name.where(use='maiden')
), the RL algorithm considers this both data during matching.
When Blocking, the RL algorithm successfully identified Patient record i as a candidate for Patient record j (they have the exact same DOB, last 4 of MRN, and Sex).
'official'
name)Patient Record i
"Verónica383 Eva64"
Patient Record j
"Verónica383 Eva64 Verónica383 Eva64"
👈🏼 bug
🐛 Before concatenating to a string, we should ensure the list of First Name values are only those associated with the respective Last Name they are stored with (in this case, either
'official'
or'maiden'
, but not both/all).
When comparing "Verónica383 Eva64"
and "Verónica383 Eva64 Verónica383 Eva64"
, the similarity score is 0.8971428571428571
. This is less than the threshold 0.9
, so Patient record i and Patient record j are deemed not a match in First Name.
Patient Record i
"Reynoso837"
Patient Record j
"Reynoso837"
When comparing "Reynoso837"
and "Reynoso837"
, the similarity score is 1.0
(exact match). Since the evaluation match function used is feature_match_exact()
, the similarity score is not evaluated against a threshold; the result is either a match (1.0
) or not (< 1.0
). Thus, Patient record i and Patient record j are deemed a match in Last Name.
In the DIBBs Basic algorithm, in Pass 1, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: First Name and Last Name. Because Patient record i and Patient record j were deemed a match in Last Name but not a match in First Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.
'maiden'
name)Interestingly, the RL algorithm runs Pass 1 again on another set of data: the same Patient record i, but with the 'maiden'
Last Name (rather than the 'official'
Last Name, which was used above; 'official'
and 'maiden'
Last Name are stored separately in the bundle.json
.)
Patient Record i
"Verónica383 Eva64"
Patient Record j
"Verónica383 Eva64 Verónica383 Eva64"
👈🏼 bug (same as above)Again, when comparing "Verónica383 Eva64"
and "Verónica383 Eva64 Verónica383 Eva64"
, the similarity score is 0.8971428571428571
. This is less than the threshold 0.9
, so Patient record i and Patient record j are deemed not a match in First Name.
Patient Record i
"Reynoso837"
Patient Record j
"Arenas932"
When comparing "Reynoso837"
and "Arenas932"
, the similarity score is 0.0
(non-match). Thus, Patient record i and Patient record j are deemed a not a match in Last Name.
Because Patient record i and Patient record j were deemed not a match in both First Name and Last Name, the RL algorithm decides in Pass 1 that Patient record i and Patient record j are not a match.
When Blocking, the RL algorithm successfully identified Patient record i as a candidate match for Patient record j (they have the exact same Zip Code, first 4 of First Name, first 4 of Last Name, and Sex).
'official'
name)Patient Record i
"240 Rippin Ranch Apt 66"
Patient Record j
"240 Rippin Ranch Apt 66"
When comparing "240 Rippin Ranch Apt 66"
and "240 Rippin Ranch Apt 66"
, the similarity score is 1.0
(exact match). This is greater than the threshold 0.9
, so Patient record i and Patient record j are deemed a match in Address.
Patient Record i
1980-09-05
Patient Record j
1980-09-05
When comparing 1980-09-05
and 1980-09-05
, for some reason the similarity score is 0.0
(non-match). 👈🏼 bug
🐛 This is a typing issue; because the values are not of the same data type (Patient record i's DOB is of type
string
vs. Patient record j's DOB is of typedatetime.time
), their equality is evaluated as false. The code inconsistently transforms DOB values fromstring
todatetime.time
; this can be optimized. IMO, typing (and other data transformations/normalization) should occur outside of matching functionality.
In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.
'maiden'
name)The results are same as above using 'maiden'
name, since Last Name is not evaluated for matching in Pass 2.
🐛 Note, however, that this may identify a bug in the Blocking code: a record with Last Name of
"Arenas932"
("maiden"
) should not have been identified as a candidate in Pass 2, because Blocking should be on first 4 of Last Name (i.e., candidates should only include records that match exactly on first 4 of Last Name, and"Aren"
!="Reyn"
). This may still be valid though, if the logic is set up such that if a Patient record matches any value within a Blocking field that has multiple values (i.e., values for first 4 of Last Name are ["Aren"
,"Reyn"
], and while"Aren"
!="Reyn"
,"Reyn"
!="Reyn"
), then they are considered a candidate, and all values are evaluated.
In the DIBBs Basic algorithm, in Pass 2, Patient record i and Patient record j must be deemed a match in both the matching fields evaluated: Address and DOB. Because Patient record i and Patient record j were deemed a match in Address but not a match in DOB, the RL algorithm decides in Pass 2 that Patient record i and Patient record j are not a match.
Using the DIBBs Enhanced algorithm, the results are the same as above except that Patient records match on DOB (correct). This is because the DIBBs Enhanced algorithm does not use feature_match_exact()
, where the date typing bug was, but instead uses feature_match_log_odds_fuzzy_compare()
, which does account for date type normalization.
Because the matching function used by the DIBBs Enhanced algorithm did not suffer from the same bug as that used by the DIBBs Basic, it ultimately corrected matched Patient record i and Patient record j due to their match on DOB (which has a high log odds score, 10.126641103800338
, boosting its DOB match score for above the DOB match threshold - and ultimately boosting its overall match score along with the match on Address, which similarly has a high log odds score, 8.438284928858774
. In fact, DOB and address have the highest log odds scores out of all match fields. Therefore, because the Address and DOB values were exact matches in Pass 2, they benefited from the full log odds scores for DOB and Address:
1.0
exact match similarity score for DOB x10.126641103800338
log odds score for DOB +1.0
exact match similarity score for Address x8.438284928858774
log odds score for Address =18.564926032659113
, which is greater than17.0
, the true match threshold for Pass 2 of DIBBs Enhanced algorithm.
Thus, Patient record i and Patient record j are deemed a match by the DIBBs Enhanced algorithm. However, the DIBBs Enhanced algorithm still suffers from the parsing issue with First and Last Name as the DIBBs Basic algorithm; however, because First and Last Name aren't evaluated in Pass 2 matching, and the DIBBs Enhanced algorithm didn't suffer from the typing bug, the algorithm deemed a match.
datetime.time
upon match evaluation, before comparing with DOB of existing Patient record. In the future, however, we should standardize DOB type elsewhere, and only once. (Currently this datetime.time
type conversion happens several places. This is best done more DRYly and separately from when matching occurs, along with any other data transformations/normalizations of match fields.)'official'
and 'maiden'
). First Name and Last Name values should be grouped accordingly, and the algorithm should be consistent and transparent in which of these name values is used for Blocking.@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the feature_match_exact
function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?
@ericbuckley The above comment is a draft - will continue to update with my findings and address this question!
@alhayward this is awesome, you uncovered two bugs, when I was thinking there might just be one! Regarding the issue with the
feature_match_exact
function. Is the data parsed from the payload (so incoming Patient data) a string and the data queried from the DB (existing Patient data) a datetime.date object?
@ericbuckley Correct, the birthdate
parsed from the incoming Patient record is of type str
, whereas the birthdate
parsed from the existing Patient record in the MPI is of type datetime.date
.
Summary
When sending through the attached patient payload twice to the link-record endpoint, both calls result in a "match not found" result. The algorithm should find the second patient record, which has the exact same PII as the first, matches the person generated in the first call.
Impact
This could be a bug in the algorithm, but it could also be a design flaw of DIBBS_BASIC. We need to investigate further to see what exactly is causing this to be not a match.
Steps to reproduce
Expected behavior
No match on the first call and a match on the second call
Additional context
bundle.json