Closed potash closed 8 years ago
The reason is use stop words is to trade off fewer comparisons for recall. In other words, we are willing to miss a few possible matches in order to dramatically reduce the number of comparisons that we have to make. Excluding stop words generally doesn't hurt matching very much because they are not that distinguishing between (i.e. they are common).
This reasoning seems as valid for NGrams as for whole words.
But what ultimately matters to you is performance on your own data.
You can get rid of stopword by editing this code https://github.com/datamade/dedupe/blob/793e2cd33a68777e1d6a101f4116ff44a6b76f3f/dedupe/index.py#L30-L34
and you can tune the threshold here,
https://github.com/datamade/dedupe/blob/793e2cd33a68777e1d6a101f4116ff44a6b76f3f/dedupe/index.py#L24
If you get dramatically better performance, let me know.
Just leaving a comment because I ran into a situation where it would be desirable to toggle stop words for Strings when making a deduper or gazetteer. I'm working with medical abbreviations for products, where 'cl' could mean chlorine, 'si' could be silicone, and numbers are almost always important to the meaning of the data. I think I'll fork and add this option, may submit a PR.
Dedupe tells me that it is removing stop words for constructing canopy predicates. However, my
string
fields don't have stop words in the traditional sense. They are names and dates of birth (treated as text for common data entry typos). However dedupe finds and removes stop words from these fields:Is this desirable? If not is there a way to turn it off? I don't think I want to use
ShortString
for these fields because I have millions of records to dedupe.