Closed GoogleCodeExporter closed 8 years ago
I think this is supported, but I'm not sure I fully understand your use case.
I assume you don't want to match entities internally within the index? That is,
you assume there are no duplicates in the index?
Can there be duplicates internally within the set of new entities?
Also, is it OK to invoke Duke on these two datasets, or do you need to "push"
the new entities into Duke?
Original comment by lar...@gmail.com
on 4 Nov 2011 at 9:53
I wasn't clear about my use case - I'll detail on it:
We have a database table containing all of our users. Currently I'm
implementing a feature for importing 'friends' from foreign systems. When a
user imports a set of 'foreign' users, I want to match them to users in our
system.
My idea is to fetch all user names once a day and build an index for duke.
Whenever a user imports 'friends', I want to link them to our users by matching
the user names (and possibly other features like E-Mail,...).
So I can assume that the 'offline' index is duplicate free and I don't need to
'push' new entities.
Original comment by FMitzl...@googlemail.com
on 4 Nov 2011 at 11:30
Ah, I see. It looks to me like you have to sets of entities, and you want to
find link entities across the two sets, rather than actually deduplicate. Am I
right?
If so, that's supported via the record linkage mode, which for each entity in
one set finds the best match in the other set.
Unfortunately, I now see that that's not documented. Created issue 51 to cover
that.
Original comment by lar...@gmail.com
on 4 Nov 2011 at 11:44
What I'm currently working on looks like this:
public void resolveFriends(MemoryMappedDataSource<User> friends) {
...
Processor proc = new Processor(config);
...
proc.link(config.getDataSources(), friends, 10);
...
}
Where config.getDataSources() returns our database connection and 'friends'
contains an in-memory data source (essentially a ColumnarDataSource wrapped
around a java.util Collection).
The idea is, that 'friends' contains a small list of external entities which
should be linked to existing users in the database. The method 'resolveFriends'
is thus called often. I'd like to build an index of the database beforehand
('step 1' in proc.link) so that 'proc.link(...)' only performs the second step.
Does something like this fits into duke's design?
Original comment by FMitzl...@googlemail.com
on 4 Nov 2011 at 3:09
Ok - I just separated index building from record linking in the processor
(pushed to my clone). Could it be that easy? ;-)
Original comment by FMitzl...@googlemail.com
on 4 Nov 2011 at 3:36
Yes, it probably is. This stuff is easier than it seems. :)
Anyway, getting late on Friday for serious hacking now. I'll take a look at
this tomorrow.
Original comment by lar...@gmail.com
on 4 Nov 2011 at 7:22
Looked through the patch now, and everything seems good. Added to the mainline
repository. Thank you!
Original comment by lar...@gmail.com
on 5 Nov 2011 at 10:07
Original issue reported on code.google.com by
FMitzl...@googlemail.com
on 4 Nov 2011 at 9:17