clararobayo / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Duplicates Facet Cache Error #567

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
before performing the following, take the key column and create a duplicate 
verson of it, fully populated with the values from key column, e,g,- id, id2, 
and then perform the following on the second column(id2) so after you 'blank 
down', etc., you will be able to more clearly see the bug:
1. Facets > Customized facets > Duplicates facet
2. Click true/include on facet
3. Return to column and Edit cells > Blank down
4. Facets > Customized facets > Facet by blank
5. Click true/include(or false include, whichever you prefer) on facet
6. All > Edit rows > Remove all matching rows
7. Now remove all facets from left pane, and create a duplicates facet 
again(like in step 1), and click true/include on the facet, and notice how the 
total number is the exact same as BEFORE you deleted the duplicates, this os 
BAD/WRONG.  Also take a look at the records and notices the duplicates are not 
there(even know the facet count indicates they are), however, the orphans to 
the duplicates that were deleted ARE still there, which is BAD/WRONG, because 
they are no longer duplicate records, because their duplicates were deleted in 
step 6 above!

This is an error that many people would likely not notice, and it took me 
sometime to figure out what was happening to my data... I finally found it was 
Refine that was the culprit, as one half to a duplicate pair was being 
constantly introduced into my supposedly cleaned records(arg.).  Anyways, this 
should likely be made a high priority fix imo(along with the entire cache 
problem). 

What version of Google Refine are you using?

What operating system and browser are you using?
OS X 10.7.3 / CHROME 18.0.1025

Is this problem specific to the type of browser you're using or it happens
in all the browsers you tried?
Happens in all browsers.

Thanks,

Eric Jarvies

Original issue reported on code.google.com by ericjarv...@gmail.com on 3 May 2012 at 12:08

GoogleCodeExporter commented 9 years ago
I had failed to mention above, that this is a cumulative effect, so if you have 
multiple Refine projects opened that you are working on, and are doing a lot of 
cross project data transferring(e.g.- 
cell.cross('someRefineProject','id')[0].cells['name'].value), those other 
projects will be effected by the aforementioned described bug in the other 
Refine project.  So until the bug is fixed, I suggest when anyone does a 
removal of duplicates, they stop and restart Refine, reload their project, and 
verify the duplicates facet is reporting '0' as it should be after one 
deleted/removes duplicate records.

Original comment by ericjarv...@gmail.com on 3 May 2012 at 12:14

GoogleCodeExporter commented 9 years ago
Thanks for the bug report.  Do you have a small test data set that we could use 
to reproduce the problem?  That would help make sure we're doing exactly the 
same things you are.

Original comment by tfmorris on 3 May 2012 at 6:48

GoogleCodeExporter commented 9 years ago
Tom,

You can do this with 'any' data set that has a handful of duplicate records in 
it... no matter how big or small, if you use the duplicates facet, remove the 
duplicate records, and remove all facets, and create a new duplicates facet, 
you'll get the error, meaning you'll get a duplicates facet showing you there 
is duplicates... but of course there isn't any because you just deleted them a 
few steps earlier.

Give it a try and you'll see what I mean.  Now then, if you introduce large 
data sets with duplicates, you then get the above problem, plus a nasty caching 
problem.

Original comment by ericjarv...@gmail.com on 3 May 2012 at 7:03