ddavisqa / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Clustering not finding duplicates when facet is showing groupings #349

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Download the attached tennis-courts-af.csv.gz file (41,000 rows) and open a 
new project using 0 as starting row
2. Create a text facet on Column2
3. You should see 1,973 choices
4. Click Edit Cells --> Cluster & Edit --> Key collision --> Fingerprint

What is the expected output? What do you see instead?
I would expect to see 1,973 potential clusters, instead I see only 2 clusters.

What version of Google Refine are you using?
Version 2.0 [r1836]

What operating system and browser are you using?
OS X 10.6.6 with Chrome 10.0.648.133
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)

Is this problem specific to the type of browser you're using or it happens
in all the browsers you tried?
Tried in Safari also and same issue.

Please provide any additional information below.
Love the product thanks for working on it.

Original issue reported on code.google.com by matt.mac...@gmail.com on 15 Mar 2011 at 12:00

GoogleCodeExporter commented 8 years ago
I think I may have found my own answer: 
http://groups.google.com/group/google-refine/browse_thread/thread/18e64a3e4eedfb
09/dc94dbc0d3106441?lnk=gst&q=cluster#dc94dbc0d3106441

I was thinking refine could do this but I'm probably using it as a golden 
hammer for this.

Original comment by matt.mac...@gmail.com on 15 Mar 2011 at 12:41

GoogleCodeExporter commented 8 years ago
Simple ruby script solved my problem

require 'rubygems'
require 'csv'

# Create the output file
CSV.open("courts-deduped.csv", "wb") do |csv|

  deduped_courts = Hash.new
  CSV.foreach("tennis-courts.csv") do |row|
    deduped_courts[row[1]] = row
  end

  deduped_courts.each do |key, value|
    csv << value
  end
end

Original comment by matt.mac...@gmail.com on 15 Mar 2011 at 12:59

GoogleCodeExporter commented 8 years ago
Refine now has a separate facet which can be used for identical duplicates.

Original comment by tfmorris on 8 Oct 2011 at 7:24

GoogleCodeExporter commented 8 years ago

Original comment by dfhu...@google.com on 9 Oct 2011 at 5:30

GoogleCodeExporter commented 8 years ago
This was added by the patch in issue 398 and appeared in Refine 2.1.

Original comment by tfmorris on 12 Dec 2011 at 8:23