abhishek0007 / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Feature request: Add merger based on datasource weight #120

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Assuming this use-case:

There are several systems which needs to be scanned for duplicates and, after 
deduplication process, best record needs to be created for users to update 
underlying systems. 
By using datasource for each system one can load rows from each system, but to 
merge properly each system may have different priority on fields to take from 
duplicates into best record. Plus age of the row (date of last modification) in 
underlying system must be taken into account.

Here is some code to take a look:

https://svn.java.net/svn/mosaic~mdm/trunk/open-dm-mi/index-core/src/main/java/co
m/sun/mdm/index/survivor/

And some info on project its used in:
https://www.youtube.com/user/HealthIT2?feature=watch (see OHMPI related videos)

Original issue reported on code.google.com by vasilievip on 12 Jun 2013 at 12:49

GoogleCodeExporter commented 8 years ago
Yes, I've been wanting to develop something along these lines for a while now. 
Duke contains utilities to create clusters containing all matching records, but 
for now the code stops there.

I guess what you want is to automatically produce a "gold standard" record for 
each cluster.

Yes, a weight for data source, age of record, and other measures can be used 
for this.

My problem is that I have a limited amount of source data to play with to 
develop this. Do you have some example data that you could share?

Original comment by lar...@gmail.com on 12 Jun 2013 at 12:52

GoogleCodeExporter commented 8 years ago
>I guess what you want is to automatically produce a "gold standard" record for 
each cluster.
Yes, this sounds like what it would be good to get out of duplication

>Do you have some example data that you could share?
I'll work on getting few samples, but not sure that it will be better than 
"limited amount of source data" you had. There is unit tests data in mosaic 
which could be more useful: 
https://svn.java.net/svn/mosaic~mdm/trunk/open-dm-mi/index-core/src/test/resourc
es/

I'm also thinking that datasource may be the same for each system, e.g. one or 
many csv files, but some column in this file can be discriminator which 
determines source system. So, duke datasource may need to have some filters to 
select proper records. 

Original comment by vasilievip on 12 Jun 2013 at 1:09

GoogleCodeExporter commented 8 years ago
Yes, having a field with an ID for the data source will make a big difference. 
It's definitely possible right now (as I use it in some applications).

I was thinking of having real data, if possible, so that I can judge the 
effectiveness of the various possible approaches. For example, I came up with 
an idea of using clustering techniques to pick the best values for each 
property, based on distance calculations between the different values. Knowing 
how well this works, and how to combine it with the other alternatives, is 
essentially impossible without being able to experiment with real data.

Anyway, I could make an attempt based on the two real data sets I have right 
now, but the result would definitely be better if I could get hold of one or 
two more.

Original comment by lar...@gmail.com on 12 Jun 2013 at 1:21

GoogleCodeExporter commented 8 years ago
From mosaic demos - merging of data is done by picking fields by data source 
weight and then use user to adjust this selection if needed. 
One way to improve this - learn from users and adjust merger based on what user 
selected, e.g. if there is some combination of source systems and fields 
contributed into best-record - take this as a pattern (field 1 from system 2, 
field X from system Y) and apply for further merging.
It seems proper merging can be tricky to implement completely automated. 

Original comment by vasilievip on 12 Jun 2013 at 1:34

GoogleCodeExporter commented 8 years ago
Sample datasets
https://github.com/open-city/dedupe/tree/master/test/datasets

Original comment by vasilievip on 12 Aug 2013 at 7:27