DSACMS / dedupliFHIR

Prototype for basic deduplication and aggregation of eCQM data
Creative Commons Zero v1.0 Universal
8 stars 0 forks source link

Incorporate data normalization for relevant demographics #55

Open cooperthompson opened 2 months ago

cooperthompson commented 2 months ago

Is your feature request related to a problem? Please describe. It doesn't appear that Splink handles data normalization. There are several demographics that either benefit from, or essentially require data normalization. For example:

Describe the intended behavior The demographics for these two patients probably should be normalized prior to submission to the record linking process. Jon Smith at 123 Main Street John Smith at 123 Main St.

Describe the solution you'd like Data normalization is a pretty complicated problem. There are different solutions for different problems. You may be able to use a USPS process to normalize address data (though you should consider the $$ cost of calling a USPS service for large number of patients, which will be necessary for the populations we are talking about). There may be other options to normalize phonetic spelling of names.

IsaacMilarky commented 1 month ago

While it is true that Splink does not support data normalization out of the box, the use of probabilistic data linkage helps a lot to help de-duplicate the records anyway.

However, it is true that adding data normalization to the fields that we are applying our blocking rules to would help to reduce the rate of error. In the short term, we can add some textual normalization to the names and addresses as we parse them. We can use tools such as NLTK to help with this preprocessing.

I will work on adding some textual normalization in the near future. Please let us know if you would like to talk more about potential solutions.