OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.75k stars 1.94k forks source link

with large datasets, moving any column(2, 3, 4, etc.) to position #1 causes irrecoverable crash and burn(no undo, etc.) #562

Open tfmorris opened 11 years ago

tfmorris commented 11 years ago

Original author: ericjarv...@gmail.com (May 01, 2012 17:48:44)

What steps will reproduce the problem?

  1. Large datasets
  2. Move a column from position 2 or greater to position 1(to the right of the All column)
  3. Data corrupts something serious, and is unrecoverable.

What is the expected output? The expected output should be a simple move of a column from one position to another, which works for 'any' position except the first column position.

What do you see instead? See attached images. My record count went from 100,000 down to 14596, and only one column is displaying. I cannot 'Undo' because it keeps producing the error seen in the attached screenshot.

What version of Google Refine are you using? 2.5

What operating system and browser are you using? OS X Chrome 17.0.x

Is this problem specific to the type of browser you're using or it happens in all the browsers you tried? Happens in any browser. Tis a nasty bug.

Please provide any additional information below.

Thanks,

Eric

Original issue: http://code.google.com/p/google-refine/issues/detail?id=562

tfmorris commented 11 years ago

From tfmorris on May 01, 2012 20:43:38: Hi Eric. The reduction in record count is intentional. Google Refine considers "indented" rows (ie empty leading cells) to be part of the "record" they're indented under. This can be useful for certain types of processing, but can definitely cause unexpected effects if you don't know it's happening.

Side effects from this may be contributing to the other problems. Refine (or browsers, depending on your point of view) isn't very good about dealing with records which have large numbers of rows because the default paging is by # of records, so 10 or 25 records could have hundreds or thousands of rows, depending on the organization of the data.

Not sure about the single column display, but it's almost certainly related. You could try switching to the row display mode (instead of the default record display mode) to see if it shows the other columns.

The error dialog that you're getting is strange. That happens when you attempt to undo? Any chance you are running out of memory on your computer? Does the same thing happen when you restart Refine and/or your computer?

Your project is almost certainly recoverable. If you'd like one of us to take a look at it, attach it here and we'll see what we can do.

tfmorris commented 11 years ago

From ericjarv...@gmail.com on May 01, 2012 20:56:40: Tom,

I understand moving a column to the first position can be used for created rows within records, but what I am referring to is that Refine does not handle things gracefully should one accidentally place a column in that first slot, for example. It causes corruption that cannot be recovered, because the 'Undo' becomes non-functional, and because the Columns start disappearing when you attempt to perform the undo process, there is no way to get back to normal, and no way to output/export. Switching between Row/Record has no effect/benefit when the above mentioned occurs. So there is no way for me to output/save the project to send it you.

I have a top of the line Apple towerwith 64GB of RAM, fast RAID drives, etc., so it is not hardware performance related.

tfmorris commented 11 years ago

From thadguidry on May 01, 2012 22:40:15: FYI, The "switching" from Records back to Rows...might take quite a while, depending on how much data Refine has to churn through... I have seen it take over 10 mins in a few of my datasets...but it did eventually return back to Rows mode.

tfmorris commented 11 years ago

From tfmorris on May 02, 2012 00:37:36: OK, I thought you were describing three problems. Sounds like it's just two (probably related) problems.

Here are a couple of other things to try: