Closed evz closed 10 years ago
For this, I'm thinking about modifying the Gazetteer class to return n matches.
I think that should be adequate for now.
@evz, Let's spend some time this week thinking about what interface we would want for a Streaming Class.
a few ways to do this.
here's one
in gazetteMatching function https://github.com/datamade/dedupe/blob/master/dedupe/clustering.py#L154
sort the dupes by the first item and then by score within the first item http://stackoverflow.com/questions/5212870/sorting-a-python-list-by-two-criteria
loop through the sorted dupes
for each starting id in a dupe pair, take the first n dupes, (equivalent to top n dupes).
you just need to create a dictionary
Nice, I particularly like the adjustable threshold. From a practical standpoint, how would you go about determining the appropriate value (and how likely is the default of 0.5 to be a good one for real-world data)?
As an aside, would this parameter be a good candidate to be explored by a library like Hyperopt?: https://github.com/hyperopt/hyperopt
See #13.
Cool. None of the related issues reference using Hyperopt. Is it not suitable for this purpose?
I don't know anything about hyperopt. I doubt that it would be useful because of the sample bias problem discussed in #13. However, if you would like to research it and can concretely propose how it might be used for setting thresholds I would love to see that. Please put any further conversation on setting thresholds in either in #13 or a new issue.
On Fri, Jul 18, 2014 at 1:45 PM, Michael R. Bernstein < notifications@github.com> wrote:
Cool. None of the related issues reference using Hyperopt. Is it not suitable for this purpose?
— Reply to this email directly or view it on GitHub https://github.com/datamade/dedupe/issues/258#issuecomment-49465934.
773.888.2718 2231 N. Monticello Ave Chicago, IL 60647
OK.
Currently calling
match
againstRecordLink
andGazetteer
classes with one candidate only returns the top match. Lets make it return up to 10.