ddavisqa / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Applying saved operation history to large file fails #331

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Applying JSON code to a file with about 20000 rows just stops about half way 
through a reconciliation with standard Freebase. No errors and no results on 
the field.

I'm trying to see if there is a row limit for reconciling with freebase.
I've currently just completed running this code on a file containing 5400 rows 
and will keep adding to see where the breakpoint lies.

Windows 7 64 bit / 4GB Memory
Google Refine version 2.0-r1836
Google Chrome version 9.0.597.84

Original issue reported on code.google.com by vinnygof...@gmail.com on 8 Feb 2011 at 3:54

GoogleCodeExporter commented 8 years ago
What errors, if any, are being reported on the console that the Refine server 
was started from?

There are no hardcoded limits.  It's much more likely that you're hitting an 
error which isn't being handled and reported properly.

Original comment by tfmorris on 8 Feb 2011 at 4:23

GoogleCodeExporter commented 8 years ago
Does the console keep an error log by chance or do I need to scroll through it 
immediately when it fails?

Original comment by vinnygof...@gmail.com on 8 Feb 2011 at 5:13

GoogleCodeExporter commented 8 years ago
I don't know yet if Refine keeps any kind of console log history but I was 
watching the console when it failed during my last test. At line "16:24:18.429" 
is when the reconcile stopped, the percentage completed window disappeared and 
no error messages popped up. 

Console window contents attached.

Original comment by vinnygof...@gmail.com on 9 Feb 2011 at 4:11

Attachments:

GoogleCodeExporter commented 8 years ago
I don't think any logging is done to a file, although a little log4j 
configuration magic could probably change that pretty easily.

At first glance, it looks to me like it thinks things completed successfully.  
Anything special about the data near where the operation stops?

Unless David has some other ideas, it'll probably take getting a copy of the 
data file and the operation(s) that you are attempting to be able to debug this.

Original comment by tfmorris on 9 Feb 2011 at 4:22

GoogleCodeExporter commented 8 years ago
How can I tell what row the operation stopped at?

Original comment by vinnygof...@gmail.com on 9 Feb 2011 at 4:49

GoogleCodeExporter commented 8 years ago
I don't see anything in the log which would indicate, but I thought you had an 
idea since you said "about halfway through."

Original comment by tfmorris on 9 Feb 2011 at 5:36

GoogleCodeExporter commented 8 years ago
I was just going according to the status box at the top of the page during the 
reconcile, it was at 53% completed when it suddenly disappeared. The file I was 
working with at that time was about 20,000 rows and was hoping it would have 
logged where it stopped for debugging purposes.

I'm currently trying to apply the exact same JSON code to a project containing 
141,160 rows. This is the largest file I have so I'm curious to see if it 
successfully completes.

Thanks for your help with this.

Original comment by vinnygof...@gmail.com on 9 Feb 2011 at 6:45

GoogleCodeExporter commented 8 years ago
Hmm....I'm wondering if he's not having a div problem with Chrome 9 as in Issue 
102 ? And that it actually was continuing the reconcile process, but he didn't 
know it ?  I typically have just looked and followed along watching in the 
Command window to be extra sure.

Original comment by thadguidry on 9 Feb 2011 at 7:12

GoogleCodeExporter commented 8 years ago
When the reconciles fail the command window stays open and there isn't any 
activity.

I needed to stop the process on my biggest file (141,160). It got to about 24% 
complete however.

I've decided to be more methodical with my approach. I have 15 files (projects) 
that range from 123 rows to 141,160 rows. I'm going to apply the JSON code to 
the smallest first and work my way up. I've attached the JSON code I've been 
working with.

Below is what I've done so far, how long it took to complete and if completed 
successfully:

Project containing 123 rows - successful in 4 minutes 
Project containing 1,171 rows - successful in 15 minutes
Project containing 4,797 rows - successful in 1 hour and 5 minutes
Project containing 5,403 rows - successful in 1 hour and 14 minutes
Project containing 9,716 rows - FAILED at 2 hours and 5 minutes (console 
message attached)
Project containing 12,493 row - currently running (started at 11:54 am - 
2/10/11)

So far the project size really has had a noticeable difference on the resources 
used. CPU usage peaks at about 20% and memory usage at about 1.5 GB.

Thanks again guys.

Original comment by vinnygof...@gmail.com on 10 Feb 2011 at 5:05

Attachments:

GoogleCodeExporter commented 8 years ago
Thanks for the additional info.  The failure looks to be related to HTTP 500 
Server Errors on the reconcile operation.  I wonder if a) they're getting 
retried by Refine and b) this is a transient or hard error.

The processing times strike me as being very long.  I'm guessing that most of 
this is network/reconciliation latency, but still an hour to reconcile 10,000 
items (5,000 rows x 2 columns) seems like a lot.

What Java heap size are you using?  If you've got the memory available, you 
should make it generous enough that it's not an issue.

Original comment by tfmorris on 10 Feb 2011 at 7:06

GoogleCodeExporter commented 8 years ago
The project I was running (12,493 rows) just stopped at about 2hours and 
3minutes into it. I've attached the contents of the console window.

The Java heap size is whatever default was for the install. I've got 4GB of 
memory  and only about 1.5 GB are being used during the processing, so what do 
you suggest I raise it to? Also, where is that setting changed?

Thanks.

Original comment by vinnygof...@gmail.com on 10 Feb 2011 at 7:20

Attachments:

GoogleCodeExporter commented 8 years ago
The default heap size is 1024M (1 GB).  Try going up to 1536 or 2048, depending 
on what other memory demands you have on the system.

You can find instructions here
http://code.google.com/p/google-refine/wiki/FaqAllocateMoreMemory

It definitely should be reporting the out of memory error though (presuming 
that's what's happening), so we'll need to look into that more.

Original comment by tfmorris on 10 Feb 2011 at 8:04

GoogleCodeExporter commented 8 years ago
I installed the java development kit and increased the heap size to 2048M. I 
tried to apply the saved operation to a file that been failing for me 
containing 9,716 rows and the reconciliation stopped again about 2 hours in on 
the "Title" column. I've attached the contents of the console window, the time 
of failure was 17:01:58.144.

Thanks.

Original comment by vinnygof...@gmail.com on 12 Feb 2011 at 2:16

Attachments:

GoogleCodeExporter commented 8 years ago
Are there any know characters within a data set that can cause issues?

Original comment by vinnygof...@gmail.com on 12 Feb 2011 at 2:19

GoogleCodeExporter commented 8 years ago
tfmorris, looking back at my code, the run() method doesn't catch and log any 
exception

http://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/r
efine/operations/recon/ReconOperation.java

So memory exceptions would go unnoticed.

vinnygoffin, if you're comfortable with Java code, would you mind trying to 
modify that file locally, wrap the run() method's content in a try catch and 
see if any exception shows up? Thank you for your patience.

Original comment by dfhu...@gmail.com on 12 Feb 2011 at 4:58

GoogleCodeExporter commented 8 years ago
Hi,
I'm not versed in Java so unfortunately i'm not sure how to do what your 
asking, sorry.

I have continued with some other testing using a file named(RED425 with 9,716 
rows)this consistently stops and doesn't complete. I've attached the most 
recent console window contents of this failure.

Trying to see if it was a piece of data within the file that is giving me the 
headaches, I split the file into 2 pieces (RED425-1 with 4,858 rows and 
RED425-2 also with 4,858 rows).

Both completed fine, so I'm not thinking it's a weird character or bad data or 
something like that. Would you agree? I've attached these files as well, in 
case you interested.

Original comment by vinnygof...@gmail.com on 15 Feb 2011 at 8:52

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you for trying again. I think you're right: it's not a weird character 
that's causing problems. It looks to me--and I'm not surprised--that the 
reconciliation service just got overloaded. So for now, is it OK for you to 
tend the reconciliation process somewhat manually (e.g., splitting it up into 
batches)? By the way, you could do this in a single project by creating a 
custom facet on any column with the expression

ceil(row.index / 4000)

Substitute 4000 for however large a batch you want. Select one value at a time 
in the facet and invoke the reconcile command.

Original comment by dfhu...@gmail.com on 15 Feb 2011 at 10:53

GoogleCodeExporter commented 8 years ago
Yup, that's what I'm going to do. I'll split the files into 5000 row segments.

Thanks for all your help with this.

Original comment by vinnygof...@gmail.com on 17 Feb 2011 at 3:37

GoogleCodeExporter commented 8 years ago
Hi again,
One more quick question.

Is there a way to automatically invoke the ceil(row.index / 4000) command in 
succession?
I was thinking of seeing if I could have it somehow run it on the first 4000 
rows, stop, then continue on the next 4000 and so on until complete.

Thanks.

Original comment by vinnygof...@gmail.com on 1 Mar 2011 at 3:09

GoogleCodeExporter commented 8 years ago
It's not really possible to say for sure from the information provided, but 
this could be related to issue 440 which would effectively prevent completion 
of any single operation (one step in the undo history) which took longer than 
one hour.

The fix for that is in SVN and will be included in the upcoming v2.5 release.

Original comment by tfmorris on 21 Oct 2011 at 4:09