dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 549 forks source link

Make class that can return up to 10 matches #258

Closed evz closed 10 years ago

evz commented 10 years ago

Currently calling match against RecordLink and Gazetteer classes with one candidate only returns the top match. Lets make it return up to 10.

fgregg commented 10 years ago

For this, I'm thinking about modifying the Gazetteer class to return n matches.

I think that should be adequate for now.

@evz, Let's spend some time this week thinking about what interface we would want for a Streaming Class.

fgregg commented 10 years ago

a few ways to do this.

here's one

in gazetteMatching function https://github.com/datamade/dedupe/blob/master/dedupe/clustering.py#L154

sort the dupes by the first item and then by score within the first item http://stackoverflow.com/questions/5212870/sorting-a-python-list-by-two-criteria

loop through the sorted dupes

for each starting id in a dupe pair, take the first n dupes, (equivalent to top n dupes).

you just need to create a dictionary

webmaven commented 10 years ago

Nice, I particularly like the adjustable threshold. From a practical standpoint, how would you go about determining the appropriate value (and how likely is the default of 0.5 to be a good one for real-world data)?

As an aside, would this parameter be a good candidate to be explored by a library like Hyperopt?: https://github.com/hyperopt/hyperopt

fgregg commented 10 years ago

See #13.

webmaven commented 10 years ago

Cool. None of the related issues reference using Hyperopt. Is it not suitable for this purpose?

fgregg commented 10 years ago

I don't know anything about hyperopt. I doubt that it would be useful because of the sample bias problem discussed in #13. However, if you would like to research it and can concretely propose how it might be used for setting thresholds I would love to see that. Please put any further conversation on setting thresholds in either in #13 or a new issue.

On Fri, Jul 18, 2014 at 1:45 PM, Michael R. Bernstein < notifications@github.com> wrote:

Cool. None of the related issues reference using Hyperopt. Is it not suitable for this purpose?

— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/258#issuecomment-49465934.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

webmaven commented 10 years ago

OK.