Closed GoogleCodeExporter closed 8 years ago
Hi there. Please note that Duke has moved to https://github.com/larsga/Duke
Anyway, the way you would solve UseCase1 with Duke is to run normal
deduplication. This will produce links between pairs of records. If you then
want to cluster the records into groups where all records represent the same
customer you can run your links into an EquivalenceClassDatabase using the API.
Just call db.addLink(id1, id2) for all links. Then, call db.getClasses() and
you can iterate over all the groups.
UseCase2: This is what's known as record linkage. Just link each of the data
sets individually against the clean dataset. Make sure to use the --singlematch
setting on the command-line, so that each customer only matches one customer in
the clean dataset.
Original comment by lar...@gmail.com
on 9 Oct 2014 at 7:25
Original issue reported on code.google.com by
vnise...@gmail.com
on 9 Oct 2014 at 7:14