dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
406 stars 214 forks source link

Should canopy predicate always remove stop words? #39

Closed potash closed 8 years ago

potash commented 8 years ago

Dedupe tells me that it is removing stop words for constructing canopy predicates. However, my string fields don't have stop words in the traditional sense. They are names and dates of birth (treated as text for common data entry typos). However dedupe finds and removes stop words from these fields:

INFO:dedupe.index:Removing stop word 20
INFO:dedupe.index:Removing stop word 03
INFO:dedupe.index:Removing stop word -0
INFO:dedupe.index:Removing stop word 6-
INFO:dedupe.index:Removing stop word 07
INFO:dedupe.index:Removing stop word -1
INFO:dedupe.index:Removing stop word 10
INFO:dedupe.index:Removing stop word 19
INFO:dedupe.index:Removing stop word 9-
INFO:dedupe.index:Removing stop word 2-
INFO:dedupe.index:Removing stop word 4-
INFO:dedupe.index:Removing stop word 01
INFO:dedupe.index:Removing stop word 04
INFO:dedupe.index:Removing stop word 08
INFO:dedupe.index:Removing stop word 11
INFO:dedupe.index:Removing stop word 98
INFO:dedupe.index:Removing stop word 02
INFO:dedupe.index:Removing stop word 05
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, date_of_birth)
INFO:dedupe.index:Removing stop word IN
INFO:dedupe.index:Removing stop word ES
INFO:dedupe.index:Removing stop word AS
INFO:dedupe.index:Removing stop word RO
INFO:dedupe.index:Removing stop word LI
INFO:dedupe.index:Removing stop word RE
INFO:dedupe.index:Removing stop word HA
INFO:dedupe.index:Removing stop word AL
INFO:dedupe.index:Removing stop word NE
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, last_name)

Is this desirable? If not is there a way to turn it off? I don't think I want to use ShortString for these fields because I have millions of records to dedupe.

fgregg commented 8 years ago

The reason is use stop words is to trade off fewer comparisons for recall. In other words, we are willing to miss a few possible matches in order to dramatically reduce the number of comparisons that we have to make. Excluding stop words generally doesn't hurt matching very much because they are not that distinguishing between (i.e. they are common).

This reasoning seems as valid for NGrams as for whole words.

But what ultimately matters to you is performance on your own data.

You can get rid of stopword by editing this code https://github.com/datamade/dedupe/blob/793e2cd33a68777e1d6a101f4116ff44a6b76f3f/dedupe/index.py#L30-L34

and you can tune the threshold here,

https://github.com/datamade/dedupe/blob/793e2cd33a68777e1d6a101f4116ff44a6b76f3f/dedupe/index.py#L24

If you get dramatically better performance, let me know.

DylanCulfogienis commented 4 years ago

Just leaving a comment because I ran into a situation where it would be desirable to toggle stop words for Strings when making a deduper or gazetteer. I'm working with medical abbreviations for products, where 'cl' could mean chlorine, 'si' could be silicone, and numbers are almost always important to the meaning of the data. I think I'll fork and add this option, may submit a PR.