alexyuen / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Finding more then two duplicates of same customer in Deduplication- or RecordLinkageMode #144

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Hi, I have question could we solve this use cases with Duke.

UseCase1: Single set of customer data with multiple duplicates of same customer.

One of our customer databases has more then two duplicates. e.g. we have 5 
"John Smith" records where there is one single "John Smith" person behind.

How could we find out that all 5 records belong together?

Another use case: We have one set with "clean customer data" and another set 
with e.g. multiple duplicates of single customer.

In this case we have a customer data with "golden copy" of "John Smith", and 
other "dirty" sources where we can have multiple "John Smith's" 

How could we link those multiple "dirty" John Smith's against "golden" John 
Smith?

Thanks,Vladimir

Original issue reported on code.google.com by vnise...@gmail.com on 9 Oct 2014 at 7:14

GoogleCodeExporter commented 8 years ago
Hi there. Please note that Duke has moved to https://github.com/larsga/Duke

Anyway, the way you would solve UseCase1 with Duke is to run normal 
deduplication. This will produce links between pairs of records. If you then 
want to cluster the records into groups where all records represent the same 
customer you can run your links into an EquivalenceClassDatabase using the API. 
Just call db.addLink(id1, id2) for all links. Then, call db.getClasses() and 
you can iterate over all the groups.

UseCase2: This is what's known as record linkage. Just link each of the data 
sets individually against the clean dataset. Make sure to use the --singlematch 
setting on the command-line, so that each customer only matches one customer in 
the clean dataset.

Original comment by lar...@gmail.com on 9 Oct 2014 at 7:25