Open GoogleCodeExporter opened 8 years ago
Yes, I've been wanting to develop something along these lines for a while now.
Duke contains utilities to create clusters containing all matching records, but
for now the code stops there.
I guess what you want is to automatically produce a "gold standard" record for
each cluster.
Yes, a weight for data source, age of record, and other measures can be used
for this.
My problem is that I have a limited amount of source data to play with to
develop this. Do you have some example data that you could share?
Original comment by lar...@gmail.com
on 12 Jun 2013 at 12:52
>I guess what you want is to automatically produce a "gold standard" record for
each cluster.
Yes, this sounds like what it would be good to get out of duplication
>Do you have some example data that you could share?
I'll work on getting few samples, but not sure that it will be better than
"limited amount of source data" you had. There is unit tests data in mosaic
which could be more useful:
https://svn.java.net/svn/mosaic~mdm/trunk/open-dm-mi/index-core/src/test/resourc
es/
I'm also thinking that datasource may be the same for each system, e.g. one or
many csv files, but some column in this file can be discriminator which
determines source system. So, duke datasource may need to have some filters to
select proper records.
Original comment by vasilievip
on 12 Jun 2013 at 1:09
Yes, having a field with an ID for the data source will make a big difference.
It's definitely possible right now (as I use it in some applications).
I was thinking of having real data, if possible, so that I can judge the
effectiveness of the various possible approaches. For example, I came up with
an idea of using clustering techniques to pick the best values for each
property, based on distance calculations between the different values. Knowing
how well this works, and how to combine it with the other alternatives, is
essentially impossible without being able to experiment with real data.
Anyway, I could make an attempt based on the two real data sets I have right
now, but the result would definitely be better if I could get hold of one or
two more.
Original comment by lar...@gmail.com
on 12 Jun 2013 at 1:21
From mosaic demos - merging of data is done by picking fields by data source
weight and then use user to adjust this selection if needed.
One way to improve this - learn from users and adjust merger based on what user
selected, e.g. if there is some combination of source systems and fields
contributed into best-record - take this as a pattern (field 1 from system 2,
field X from system Y) and apply for further merging.
It seems proper merging can be tricky to implement completely automated.
Original comment by vasilievip
on 12 Jun 2013 at 1:34
Sample datasets
https://github.com/open-city/dedupe/tree/master/test/datasets
Original comment by vasilievip
on 12 Aug 2013 at 7:27
Original issue reported on code.google.com by
vasilievip
on 12 Jun 2013 at 12:49