SORMAS-Foundation / SORMAS-Project

SORMAS (Surveillance, Outbreak Response Management and Analysis System) is an early warning and management system to fight the spread of infectious diseases.
https://sormas.org
GNU General Public License v3.0
292 stars 140 forks source link

Adding the social security number (National health id) to the list of variables for duplicate person detection [2] #2178

Closed bernardsilenou closed 4 years ago

bernardsilenou commented 4 years ago

Problem Description

To improve the sensitivity of our duplicate detection of case, person and contact entities, we should add the social security number SSN to the person identifiers. For countries that uses such a system, the SSN is a unique identifier of the person, even stronger than the name of the person. Our old duplicate case detection is defined in #757

Feature Description

Proposed Change

a.) national health ID b.) with positive check for passport number then we show duplicate, if not we ignore it

Possible Alternatives

Additional Information

bernardsilenou commented 4 years ago

@MateStrysewskeSym Still, i do not know how a and b can be combined t and compared with the threshold set on the server? Do we just compare only the value of a with the threshold since the others are perfect match?

MateStrysewske commented 4 years ago

@bernardsilenou I suggest we talk about this in detail once we are able to prioritize it. It's definitely a good idea.

bernardsilenou commented 4 years ago

OK, I for got to add date of report

max-hzi commented 4 years ago

Can we please also add date of birth? This would be relevant for the German system, as they do not work with the social security number. Or should I create a new issue for this?

bernardsilenou commented 4 years ago

@max-hzi DOB is already included now but only considered if it is not missing in both cases

max-hzi commented 4 years ago

Thanks a lot for the quick response dear Bernard!

bernardsilenou commented 4 years ago

optimal

bernardsilenou commented 4 years ago

Consider refinement in #1644 when implementing this issue. Here is the refinement on duplicate person detectiion: @Iheanacho2027 similarity of cotantact has 2 levels: 1, Similar contact person: This uses only the first name and last name for now, We can latter add variables linked to the person like sex and DOB/age. We do not considers region and district at this stage of identifying similar person. I do not think we should also limit the search only to persons thath the user has access to because this will prevent us from identifying contacts outside the jurisdiction of the user. I suggest we go with this: Only take persons into account that have exact (not exact but high similarity match, say >90% and can be changed by user) first name and last name combined, in collaboration with age/ DOB, gender and National health ID. This will mean creating another parameter to determine the cutoff limit for similar person detection in addition to the parameter that we have for similar contact detection. This will prevent the system from suggesting many persons to the user but will not improve search time much since we still have to search the whole person table. This issue is related to #2178

2, Similar contacts: For this, the system will check for similar names, regions, districts, age, sex, National health ID of all the contacts related to the source case.

Iheanacho2027 commented 4 years ago

@bernardsilenou having only level one running for Nigeria in the contact detection module would be very helpful if it combines both the first name and the last name when checking for similar contacts. I see alot of situations where the system only checks with one variable name and brings out about 40 - 100 names that have zero relation to the contact which is about to be imported i.e it doesnt make sense if the system is using age and sex to check for similarity when both first name and last name combined isnt the same, it slows down sessions like import

bernardsilenou commented 4 years ago

@Iheanacho2027 Level 1 checking should include both first and last name. For example "john man" and "john michael" should not be suggested as duplicates by the system. If only one name is used, please create an issue, it might be a bug. If both names are used but many false duplicate names a suggested by the system, then we need to increase the threshold level to say 0,8 or 0,9. This threshold value is not fixed, we have to play with it as you use the system in order to find a comfortable cutoff.

bernardsilenou commented 4 years ago

For duplicate contact detection, we need both 1 and 2, point 1 only is not enough

bernardsilenou commented 4 years ago

I did a manual simulation test today on sormas.symda server and also using R, by calculating the distance between 2 strings using the qgram method with q = 1. Some few points i found:

  1. The distance measure is insensitive to the order of first name and last name as long as the strings are all converted to lower case, deleted blank spaces and concatenated. This works perfectly as expected.
  2. The default tolerance was too low, cases with similar first names but different last names were recommended as duplicates, for example "nellynelly" and "nellyemmanuel" with distance d = 7. d=0 for perfect match.
  3. I created many names with minor spelling errors like omission/addition of a letter and a good cut off value for d = 3. Higher values of d were for non similar names.
  4. Since the SSN is usually not available for all entities and also not a required field, adding it to the similarity measure did not improve it. However for systems where the DOB is a required field for all entities, we can include it. For us to improve sensitivity by including other variables, we must make them a required variable.
bernardsilenou commented 4 years ago

@MateStrysewskeSym I suggest this improvement for duplicate case detection: We can do a 2 sage method: stage1: filtering

stage2 : applying similarity measure

bernardsilenou commented 4 years ago

For duplicate contact detection: we do as we already do: Identify duplicate person using stage2 and identify duplicate contact using stage1.