datasciencecampus / pprl_toolkit

The privacy-preserving record linkage toolkit: a proof-of-concept public demo of next-gen data linkage techniques.
https://datasciencecampus.github.io/pprl_toolkit/
MIT License
6 stars 1 forks source link

Enforce consistency on missing-value handling in `features.py` #28

Closed matweldon closed 4 months ago

matweldon commented 5 months ago

features.py implements several feature-processing functions. For missing values, there are two plausible ways of embedding them into the Bloom filter:

The second option is preferable, because we often want a missing value to compare to a non-missing value. For example, one record may be coded as 'Female' and the other record as NA -- there's no reason to prefer missing records to have greater similarity to other records with missing values, as opposed to having equal similarity to all other records.

I think currently the preferred option is implemented on some of the feature-processing functions, but not consistently. We need to ensure it's consistent, and also implement tests to check for this handling.

matweldon commented 5 months ago

Can we add an abstraction that enforces a common signature and behaviours?