Change duplicate person detection

SORMAS-JanBoehme commented 3 years ago

Summary: The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)

See https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23#issuecomment-850396503 for a description on how the duplicate detection for persons is working at the moment.

This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.

Basic concept:

The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.

Prerequisites:

PostgreSQl database needs to have the module pg_trgm enabled to allow trigram checks on the database during query execution
Executing tests on a SORMAS database to determine the load being put on the database when using multiple "similarity" methods during query execution if the database has 1.000.000+ persons. Otherwise this feature may be to much to handle for the international version where everything is stored in one instance
Identify and create on ore more appropriate index on the database to speed up the trigram calculation

Changes to config values

Remove values namesimilaritythreshhold

Introduce values

Name	Value Range	Default Value
DuplicateDetectionPersonMaxResults	0 - 100	tbd
DuplicateDetectionPersonNameWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonNameThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonSexWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonSexThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonBirthdateDayWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonBirthdateDayThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonBirthdateMonthWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonBirthdateMonthThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonBirthdateYearWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonBirthdateYearThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonPassportNumberWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonPassportNumberThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonHealthIdWeight	0.0f - 5.0f	tbd
DuplicateDetectionPersonHealthIdThreshhold	0.0f - 1.0f	tbd
DuplicateDetectionPersonResultThreshhold	0.0f - 1.0f	tbd

Process:

The following assumes that all of the weights have a value > 0.0f. If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.

Take the first and last name entered by the user and use those to do a first evaluation of possible duplicates by using "similarity" when selecting the data. Check for similarity values of first name, last name alone and compare it to each other in case the names have been switched by accident. Would basically look something like this (the actual query is probably way more complicated):

SELECT data FROM tables
WHERE
SIMILARITY(firstName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, lastNameEntered)  > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(firstName, lastNameEntered) > DuplicateDetectionPersonNameThreshhold OR
SIMILARITY(lastName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold
LIMIT DuplicateDetectionPersonMaxResults
ORDER BY highestValue DESC

Do the following checks for all results:

2.1 Sex

If equal --> score of 1.0f
If any of the compared sides has a value of UNKNOWN or null --> score of 1.0f
If person.sex is set but not equal --> score of 0.5f

2.2 Person.birthdateDD

If it is an exact match --> score of 1.0f
The score is reduced by 0.1f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 31 and 1. This check does not take the actual month into account and uses 31 as base for the calculation. Checking it against the month would make this way more complex while providing little in terms of accuracy (i.e 13 and 14 would have a score of 0.9 & 03 and 27 have a score of 0.3f)

2.3 Person.birthdateMM

If it is an exact match --> score of 1.0f
The score is reduced by 0.2f for each value they are apart until it reaches 0.0f. Make sure to include "wrap around" between 12 and 1.

2.4 Person.birthdateYYYY

If it is an exact match --> score of 1.0f
The score is reduced by 0.1f for each value they are apart until it reaches 0.0f.

2.5 Person.passportNumber

Perform a trigram calculation and use the resulting score

2.6 Person.nationalHealthId

Perform a trigram calculation and use the resulting score

Apply weights Multiply the scores for each test with the corresponding weight and normalize the value to a result of 0.0f - 1.0f. Could look something like this:

float maxScore;
float achievedScore;

for each(Result result in allResults){
    maxScore = 0.0f;
    achievedScore = 0.0f;

    if(DuplicateDetectionPersonNameWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonNameWeight
        achievedScore += result.DuplicateDetectionPersonNameScore * DuplicateDetectionPersonNameWeight
    }

    if(DuplicateDetectionPersonSexWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonSexWeight
        achievedScore += result.DuplicateDetectionPersonSexScore * DuplicateDetectionPersonSexWeight
    }

    if(DuplicateDetectionPersonBirthdateDayWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateDayWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateDayScore * DuplicateDetectionPersonBirthdateDayWeight
    }

    if(DuplicateDetectionPersonBirthdateMonthWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateMonthWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateMonthScore * DuplicateDetectionPersonBirthdateMonthWeight
    }

    if(DuplicateDetectionPersonBirthdateYearWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateYearWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateYearScore * DuplicateDetectionPersonBirthdateYearWeight
    }

    if(DuplicateDetectionPersonPassportNumberWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonPassportNumberWeight
        achievedScore += result.DuplicateDetectionPersonPassportNumberScore * DuplicateDetectionPersonPassportNumberWeight
    }

    if(DuplicateDetectionPersonNameWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonHealthIdWeight
        achievedScore += result.DuplicateDetectionPersonHealthIdScore * DuplicateDetectionPersonHealthIdWeight
    }

    if(achievedScore / maxScore >= DuplicateDetectionPersonResultThreshhold){
        //Is possible duplicate
    }
}

Issues that may be connected to this: https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23 https://github.com/hzi-braunschweig/SORMAS-Project/issues/3560 https://github.com/hzi-braunschweig/SORMAS-Project/issues/5576

marko-arn commented 3 years ago

@kwa20 @Jan-Boehme As I am currently comparing the data between our IfSG-application (Octoware) and Sormas, I have also noticed several duplicates for persons.

Quite often it happens that people have two first names, one of which is the call name.Depending on who reports the contact/case/ep, only one or both names are given.

The duplicate recognition should, in the case of more than one first name, check each name separated by a space individually.

bernardsilenou commented 3 years ago

@Jan-Boehme @MateStrysewske @kwa20

As mentioned in Prerequisites section, I also think it would be good to first do some form of testing/simulation before implementation.
Two point I can think of are:
- Performance , especially on the mobile devices
- sensitivity, ie if there is a significant amount of duplicate that the weighted method can detect over the non weighted.
Users can have options to choose the weights for each variable. Reason being that not all the variables or weight apply to all instances

SORMAS-Foundation / SORMAS-Project