SORMAS-Foundation / SORMAS-Project

SORMAS (Surveillance, Outbreak Response Management and Analysis System) is an early warning and management system to fight the spread of infectious diseases.
https://sormas.org
GNU General Public License v3.0
293 stars 143 forks source link

Change duplicate person detection #5758

Open SORMAS-JanBoehme opened 3 years ago

SORMAS-JanBoehme commented 3 years ago

Summary: The departments of health often times face the problem that the duplicate detection for persons does not trigger when creating a new case or contact because when the data was entered some typo happened (i.e. entering 1991 as the birth year instead of 1990)

See https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23#issuecomment-850396503 for a description on how the duplicate detection for persons is working at the moment.

This issue is meant to describe a proposed change to the current duplicate detection with the goal of providing more relevant results to the user.

Basic concept:

The detection should use weighted results when checking for duplicates instead of just looking for perfect matches. If the sum of all checks exceeds a configured threshhold the result is considered a possible duplicate and presented to the user. The values used in the duplicate detection should be made avilable for GSA admins via the UI to allow them to adjust the granularity of the detection themselves without having to raise a ticket.

Prerequisites:

Changes to config values

Remove values namesimilaritythreshhold

Introduce values

Name Value Range Default Value
DuplicateDetectionPersonMaxResults 0 - 100 tbd
DuplicateDetectionPersonNameWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonNameThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonSexWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonSexThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateDayWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateDayThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateMonthWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateMonthThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonBirthdateYearWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonBirthdateYearThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonPassportNumberWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonPassportNumberThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonHealthIdWeight 0.0f - 5.0f tbd
DuplicateDetectionPersonHealthIdThreshhold 0.0f - 1.0f tbd
DuplicateDetectionPersonResultThreshhold 0.0f - 1.0f tbd

Process:

The following assumes that all of the weights have a value > 0.0f. If the admin set one of the values to 0.0f it indicates that the admin does not want those check to have an impact on the duplicate detection and the corresponding check will be skipped and must not have an impact when calculating the final result.

  1. Take the first and last name entered by the user and use those to do a first evaluation of possible duplicates by using "similarity" when selecting the data. Check for similarity values of first name, last name alone and compare it to each other in case the names have been switched by accident. Would basically look something like this (the actual query is probably way more complicated):

    SELECT data FROM tables
    WHERE
    SIMILARITY(firstName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold OR
    SIMILARITY(lastName, lastNameEntered)  > DuplicateDetectionPersonNameThreshhold OR
    SIMILARITY(firstName, lastNameEntered) > DuplicateDetectionPersonNameThreshhold OR
    SIMILARITY(lastName, firstNameEntered) > DuplicateDetectionPersonNameThreshhold
    LIMIT DuplicateDetectionPersonMaxResults
    ORDER BY highestValue DESC
  2. Do the following checks for all results:

2.1 Sex

2.2 Person.birthdateDD

2.3 Person.birthdateMM

2.4 Person.birthdateYYYY

2.5 Person.passportNumber

2.6 Person.nationalHealthId

  1. Apply weights Multiply the scores for each test with the corresponding weight and normalize the value to a result of 0.0f - 1.0f. Could look something like this:
float maxScore;
float achievedScore;

for each(Result result in allResults){
    maxScore = 0.0f;
    achievedScore = 0.0f;

    if(DuplicateDetectionPersonNameWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonNameWeight
        achievedScore += result.DuplicateDetectionPersonNameScore * DuplicateDetectionPersonNameWeight
    }

    if(DuplicateDetectionPersonSexWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonSexWeight
        achievedScore += result.DuplicateDetectionPersonSexScore * DuplicateDetectionPersonSexWeight
    }

    if(DuplicateDetectionPersonBirthdateDayWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateDayWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateDayScore * DuplicateDetectionPersonBirthdateDayWeight
    }

    if(DuplicateDetectionPersonBirthdateMonthWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateMonthWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateMonthScore * DuplicateDetectionPersonBirthdateMonthWeight
    }

    if(DuplicateDetectionPersonBirthdateYearWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonBirthdateYearWeight
        achievedScore += result.DuplicateDetectionPersonBirthdateYearScore * DuplicateDetectionPersonBirthdateYearWeight
    }

    if(DuplicateDetectionPersonPassportNumberWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonPassportNumberWeight
        achievedScore += result.DuplicateDetectionPersonPassportNumberScore * DuplicateDetectionPersonPassportNumberWeight
    }

    if(DuplicateDetectionPersonNameWeight > 0.0f) {
        maxScore += DuplicateDetectionPersonHealthIdWeight
        achievedScore += result.DuplicateDetectionPersonHealthIdScore * DuplicateDetectionPersonHealthIdWeight
    }

    if(achievedScore / maxScore >= DuplicateDetectionPersonResultThreshhold){
        //Is possible duplicate
    }
}

Issues that may be connected to this: https://github.com/hzi-braunschweig/SORMAS-Glossary/issues/23 https://github.com/hzi-braunschweig/SORMAS-Project/issues/3560 https://github.com/hzi-braunschweig/SORMAS-Project/issues/5576

marko-arn commented 3 years ago

@kwa20 @Jan-Boehme As I am currently comparing the data between our IfSG-application (Octoware) and Sormas, I have also noticed several duplicates for persons.

Quite often it happens that people have two first names, one of which is the call name.Depending on who reports the contact/case/ep, only one or both names are given.

The duplicate recognition should, in the case of more than one first name, check each name separated by a space individually.

bernardsilenou commented 3 years ago

@Jan-Boehme @MateStrysewske @kwa20