Support record clustering (as opposed to just cell clustering)

GoogleCodeExporter commented 8 years ago

Clustering should be done at the record level (currently only done at cell 
level).  This should also allow for duplicate records to be spotted.

Original issue reported on code.google.com by iainsproat on 23 Jun 2010 at 4:09

GoogleCodeExporter commented 8 years ago

Some related discussion to this from the May 2010 list archives: 
http://lists.freebase.com/pipermail/freebase-discuss/2010-May/001491.html

I guess this could be manually implemented for now by making a joined column 
with all relevant fields, then clustering that field with existing tools, 
whereafter a manual GROUP BY on the resulting dataset using the clustered 
column will return only unique rows. 

The issue then, however, would be to know how to utilize any complementary data 
found amongst the duplicates... Is there any official wiki talk page to discuss 
this feature yet?

Original comment by fredrik....@gmail.com on 29 Oct 2010 at 1:23

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 12 Dec 2010 at 7:48

Changed title: Support record clustering (as opposed to just cell clustering)

TSSlade / google-refine

Support record clustering (as opposed to just cell clustering) #90