Open SORMAS-JanBoehme opened 3 years ago
@kwa20 @Jan-Boehme As I am currently comparing the data between our IfSG-application (Octoware) and Sormas, I have also noticed several duplicates for persons.
Quite often it happens that people have two first names, one of which is the call name.Depending on who reports the contact/case/ep, only one or both names are given.
The duplicate recognition should, in the case of more than one first name, check each name separated by a space individually.
@Jan-Boehme @MateStrysewske @kwa20
As mentioned in Prerequisites section, I also think it would be good to first do some form of testing/simulation before implementation.
Two point I can think of are:
Users can have options to choose the weights for each variable. Reason being that not all the variables or weight apply to all instances
Summary: The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)
See https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23#issuecomment-850396503 for a description on how the duplicate detection for persons is working at the moment.
This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.
Basic concept:
The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.
Prerequisites:
Changes to config values
Remove values namesimilaritythreshhold
Introduce values
Process:
The following assumes that all of the weights have a value > 0.0f. If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.
Take the first and last name entered by the user and use those to do a first evaluation of possible duplicates by using "similarity" when selecting the data. Check for similarity values of first name, last name alone and compare it to each other in case the names have been switched by accident. Would basically look something like this (the actual query is probably way more complicated):
Do the following checks for all results:
2.1 Sex
2.2 Person.birthdateDD
2.3 Person.birthdateMM
2.4 Person.birthdateYYYY
2.5 Person.passportNumber
2.6 Person.nationalHealthId
Issues that may be connected to this: https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23 https://github.com/hzi-braunschweig/SORMAS-Project/issues/3560 https://github.com/hzi-braunschweig/SORMAS-Project/issues/5576