datatonic / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Match entities against fix index #49

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
For our application I'd like to build once a day an index of our entity 
database and match new entities (online) against this index.

Is this setting supported by duke?

Original issue reported on code.google.com by FMitzl...@googlemail.com on 4 Nov 2011 at 9:17

GoogleCodeExporter commented 8 years ago
I think this is supported, but I'm not sure I fully understand your use case.

I assume you don't want to match entities internally within the index? That is, 
you assume there are no duplicates in the index?

Can there be duplicates internally within the set of new entities?

Also, is it OK to invoke Duke on these two datasets, or do you need to "push" 
the new entities into Duke?

Original comment by lar...@gmail.com on 4 Nov 2011 at 9:53

GoogleCodeExporter commented 8 years ago
I wasn't clear about my use case - I'll detail on it:

We have a database table containing all of our users. Currently I'm 
implementing a feature for importing 'friends' from foreign systems. When a 
user imports a set of 'foreign' users, I want to match them to users in our 
system. 

My idea is to fetch all user names once a day and build an index for duke. 
Whenever a user imports 'friends', I want to link them to our users by matching 
the user names (and possibly other features like E-Mail,...).

So I can assume that the 'offline' index is duplicate free and I don't need to 
'push' new entities.

Original comment by FMitzl...@googlemail.com on 4 Nov 2011 at 11:30

GoogleCodeExporter commented 8 years ago
Ah, I see. It looks to me like you have to sets of entities, and you want to 
find link entities across the two sets, rather than actually deduplicate. Am I 
right?

If so, that's supported via the record linkage mode, which for each entity in 
one set finds the best match in the other set.

Unfortunately, I now see that that's not documented. Created issue 51 to cover 
that.

Original comment by lar...@gmail.com on 4 Nov 2011 at 11:44

GoogleCodeExporter commented 8 years ago
What I'm currently working on looks like this:

public void resolveFriends(MemoryMappedDataSource<User> friends) {
   ...
   Processor proc = new Processor(config);
   ...
   proc.link(config.getDataSources(), friends, 10);
   ...
}

Where config.getDataSources() returns our database connection and 'friends' 
contains an in-memory data source (essentially a ColumnarDataSource wrapped 
around a java.util Collection).

The idea is, that 'friends' contains a small list of external entities which 
should be linked to existing users in the database. The method 'resolveFriends' 
is thus called often. I'd like to build an index of the database beforehand 
('step 1' in proc.link) so that 'proc.link(...)' only performs the second step.

Does something like this fits into duke's design?

Original comment by FMitzl...@googlemail.com on 4 Nov 2011 at 3:09

GoogleCodeExporter commented 8 years ago
Ok - I just separated index building from record linking in the processor 
(pushed to my clone). Could it be that easy? ;-)

Original comment by FMitzl...@googlemail.com on 4 Nov 2011 at 3:36

GoogleCodeExporter commented 8 years ago
Yes, it probably is. This stuff is easier than it seems. :)

Anyway, getting late on Friday for serious hacking now. I'll take a look at 
this tomorrow.

Original comment by lar...@gmail.com on 4 Nov 2011 at 7:22

GoogleCodeExporter commented 8 years ago
Looked through the patch now, and everything seems good. Added to the mainline 
repository. Thank you!

Original comment by lar...@gmail.com on 5 Nov 2011 at 10:07