Insufficient Blocking Rules on Some Data

Describe the bug Duplicates are not found when running on certain test data supplied by Octavian Chiorcea [coctavius@mdinteractive.com](mailto:coctavius@mdinteractive.com)

To Reproduce Steps to reproduce the behavior:

Run the script as usual on the following test data:

cage@yeti cli]$ cat /tmp/test/4/small.csv 
id,truth_value,family_name,given_name,gender,birth_date,phone,street_address,city,state,postal_code,SSN
8,9b0b0b7c-e05e-4c89-991d-268eab2483f7,Obrien,Curtis,M,07/02/1996,,300 Amy Corners Suite 735,Rileytown,Alaska,60281,480-21-0833
342,9b0b0b7c-e05e-4c89-991d-268eab2483f7,Orbien,Cutris,M,07/02/1996,,300 Amy oCrenrs Suite 735,Rileytown,Alaska,60281,480-210-833
502,9b0b0b7c-e05e-4c89-991d-268eab2483f7,bOrien,Curtsi,M,07/02/1996,,300 AmyCo rners Suite 735,Rileytown,Alaska,60281,480-21-8033
618,9b0b0b7c-e05e-4c89-991d-268eab2483f7,Obrine,Curtsi,M,07/02/1996,,300 AmyC orners Suite7 35,Rileytown,Alaska,60281,48-021-0833
744,9b0b0b7c-e05e-4c89-991d-268eab2483f7,bOrien,Curtsi,M,07/02/1996,,3 00Amy Corners Suite 735,Rileytown,Alaska,60281,480-210-833
223,04584982-ae7a-44a1-b4f0-e927a8bab0e1,Russell,Lindsay,F,02/05/1977,,2110 Kimberly Villages Apt. 639,New David,Wyoming,52082,211-52-6998
225,04584982-ae7a-44a1-b4f0-e927a8bab0e1,R,Lindsay,F,02/05/1977,,2110 Kimberly Villages Apt. 639,New David,Wyoming,52082,211-52-6998
226,04584982-ae7a-44a1-b4f0-e927a8bab0e1,Russel Smith,Lindsay,F,02/05/1977,,2110 Kimberly Villages Apt. 639,New David,Wyoming,52082,211-52-6998
273,04584982-ae7a-44a1-b4f0-e927a8bab0e1,Russlel,Lnidsay,F,02/05/1977,,2110 Kimbelry Vilalges Apt. 639,New David,Wyoming,52082,211-52-6989
311,04584982-ae7a-44a1-b4f0-e927a8bab0e1,Russlel,Lindasy,F,02/05/1977,,2110 Kimbelry Villgaes Apt. 639,New David,Wyoming,52082,211-52-9698
652,04584982-ae7a-44a1-b4f0-e927a8bab0e1,uRssell,Lidnsay,F,02/05/1977,,2110 Kimberly Vlilagse Apt. 639,New David,Wyoming,52082,121-52-6998
726,04584982-ae7a-44a1-b4f0-e927a8bab0e1,uRssell,Lindasy,F,02/05/1977,,2110 Kmiberly Vilalges Apt. 639,New David,Wyoming,52082,2115-2-6998

Call the script like this: poetry run python ecqm_dedupe.py dedupe-data --fmt CSV /tmp/test/4/small.csv /tmp/test/

Expected behavior The script should output an excel file with all of the duplicates identified.

Actual behavior In the results xlsx file , i see it detects duplicates only the ones that have wrong names (the ones that are the correct names, it seems to have a different cluster id) - perhaps this happens because it doesn't use birth_date to detect dupes. I tried to change deduplifhirLib/settings.py , but it didn't seem it had any effect changing config there.

DSACMS / dedupliFHIR

Insufficient Blocking Rules on Some Data #43