data61 / clkhash

CLK hash: hash pii for entity matching
Apache License 2.0
47 stars 9 forks source link

Handle hashing missing values #1

Open hardbyte opened 7 years ago

hardbyte commented 7 years ago

Say a row doesn't have data for one field:

INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
0,Libby Slemmer,1933/09/13,F
1,Garold Staten,,M
2,Yaritza Edman,1972/11/30,

What should we do? 1) Current approach is still creating a CLK for the record, it will either be hashing an empty string or skipping that feature meaning less bits get set which means it might not be considered a match. 2) We could drop the row and locally output a list of entities that were dropped. 3) We could throw an error and leave it up to the user

In any case I think we should decide what approach is best and document our decision in the docs.

Aha! Link: https://csiro.aha.io/features/ANONLINK-55

sjhardy commented 7 years ago

We could have a bit mask that shows which of the fields were present (relative to the input schema) so that we would know on subsequent processing that the match probabilities need to be interpreted differently, or return with the probability how many parts of the schema where not matched.