datatonic / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Allow multiple matches for record linkage #55

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Currently only the best match per record is propagated to the given 
MatchListener (see Processor.matchRL). In our use case we'd like to collect the 
top-k matches per item, leaving the ultimate choice to the user.

The easy way for implementing this behaviour would be to add a parameter to the 
link(...) method. But maybe there is an efficient way of implementing a 'meta 
matcher' which just implements the 'best match' semantics and passes the 
'final' match to a registered MatchListener?

Original issue reported on code.google.com by FMitzl...@googlemail.com on 7 Nov 2011 at 8:22

GoogleCodeExporter commented 8 years ago
I think defining a callback interface which receives (record, probability) 
values would be a good first step, and, yes, passing it as a parameter to 
link(). The current behaviour could be default.

Haven't time right now to see how it could be further integrated.

Original comment by lar...@gmail.com on 7 Nov 2011 at 8:47

GoogleCodeExporter commented 8 years ago
But isn't this callback interface just the MatchListener interface? The even 
more simplistic approach could be to add a boolean parameter which determines, 
whether 'match(...)' or 'matchRL(...)' should be called.

What do you think about adding a 'finished()' method to the MatchListener 
interface? This could be called after all records were processed. Then we could 
implement a 'BestFitMetaMatchListener' which just stores best matches per 
record and finally (when 'finished' is called) calls the 'matches' methods of a 
given MatchListener.

Original comment by FMitzl...@googlemail.com on 8 Nov 2011 at 4:39

GoogleCodeExporter commented 8 years ago
> But isn't this callback interface just the MatchListener interface?

Actually, now that I think about it you're probably right.

> The even more simplistic approach could be to add a boolean parameter which 
determines, 
> whether 'match(...)' or 'matchRL(...)' should be called.

Hmmm. How would this help? You lost me here.

> What do you think about adding a 'finished()' method to the MatchListener 
interface?

That's already there, except it's called 'endProcessing()'. There's also an 
'endRecord()' which tells you when we've finished looking for matches for a 
specific record. This means you can do top-k matching and only keep the best k 
matches for each record (and discard the rest before processing is over).

 >Then we could implement a 'BestFitMetaMatchListener' which just stores best matches per record and finally 
> (when 'finished' is called) calls the 'matches' methods of a given 
MatchListener.

Looking at it more closely I think you have what you need already. Just set the 
threshold low. That may cause Duke to start querying all your properties, in 
which case you may want to raise the threshold a bit and set a lower 
maybe-threshold. Duke will then pass all (or at least most) matches to your 
listener, and together with the 'endRecord()' call you then have what you need 
for top-k matching.

Should work, no?

Original comment by lar...@gmail.com on 8 Nov 2011 at 6:56

GoogleCodeExporter commented 8 years ago
>> The even more simplistic approach could be to add a boolean parameter which 
determines, 
>> whether 'match(...)' or 'matchRL(...)' should be called.

> Hmmm. How would this help? You lost me here.

If I understand the Processor's 'linkRecords' method correctly, records are 
matched using the 'matchRL' method (Processor.java:185) which only calls the 
registered MatchListeners for the best matching record (Processor.java:275). If 
this is true, no MatchListener has access to other matching entities.

>> What do you think about adding a 'finished()' method to the MatchListener 
interface?

> That's already there, except it's called 'endProcessing()'. There's also an 
>'endRecord()' which tells you when we've finished looking for matches for a 
> specific record. This means you can do top-k matching and only keep the best k
> matches for each record (and discard the rest before processing is over).

Thats great! I just missed that...

>>Then we could implement a 'BestFitMetaMatchListener' which just stores best 
>> matches per record and finally 
>> (when 'finished' is called) calls the 'matches' methods of a given 
MatchListener.

> Looking at it more closely I think you have what you need already. Just set 
the 
> threshold low. That may cause Duke to start querying all your properties, in 
which 
> case you may want to raise the threshold a bit and set a lower 
maybe-threshold.
> Duke will then pass all (or at least most) matches to your listener, and 
together 
> with the 'endRecord()' call you then have what you need for top-k matching.

> Should work, no?

I'm with you, modulo the 'matchRL'-issue mentioned above.

(Code references are relative to 
http://code.google.com/p/duke/source/browse/src/main/java/no/priv/garshol/duke/P
rocessor.java?r=533fe8427a63ea575227fc407be1b609120fbed0)

Original comment by FMitzl...@googlemail.com on 8 Nov 2011 at 8:58

GoogleCodeExporter commented 8 years ago
> If I understand the Processor's 'linkRecords' method correctly, records are 
matched using the 'matchRL' 
> method (Processor.java:185) which only calls the registered MatchListeners 
for the best matching record 
> (Processor.java:275). If this is true, no MatchListener has access to other 
matching entities.

Ah, true. Sorry. I was in a hurry. (Working on something else today.)

There is actually a bigger issue hiding here. The algorithm I have implemented 
for choosing record pairs in record linkage mode is fairly primitive, and many 
others exist. I read a nice overview in 
http://research.microsoft.com/pubs/153478/msr-report-1to1.pdf and have been 
thinking about how to make algorithms pluggable ever since.

(You don't need to read the next bit. It's really me talking to myself to think 
this through.)

The trouble is that plugging in other algorithms isn't trivial, since they may 
need to store considerable temporary state, and so may want to approch matching 
in a different way. Looking through the paper again, Unconstrained ManyMany and 
One-to-Many FirstChoice are easy (indeed, the second is what Duke does now). 

The various approaches to One-to-One Entity Resolution are considerably harder. 
It seems that they all need to either store the entire set of (r1, r2, conf) 
tuples and then analyze it, except some that need to control the order in which 
records are matched. Still, I guess this is no reason why we can't make this 
pluggable.

Hang on a bit, and I'll work in what you need for this.

Original comment by lar...@gmail.com on 8 Nov 2011 at 12:21

GoogleCodeExporter commented 8 years ago
Okay, I reworked things so that there are internal match filters which handle 
the two strategies we currently have. I got rid of the justOne parameter and 
the matchRL method, so that now the code is a lot cleaner.

The linkRecords method now defaults to passing everything through, so you get 
the behaviour you want. Still not sure how to make matching strategies 
choosable from outside, but I'll get to that later.

Original comment by lar...@gmail.com on 9 Nov 2011 at 5:17