Support checkpointing of long running operations like Add column by fetching URL

GoogleCodeExporter commented 8 years ago

I'm trying to add new column based on value of other via web service using the 
feature "Add column by fetching URLs". I have about 200.000 rows and I have 
tried multiple times with different memory options but become every time after 
couple of hours (or days) OutOfMemoryError exception. In my case I have a list 
of freebase movie ids (e.g. /m/072x72) and I'm trying to fetch the movie 
descriptions via freebase web service (e.g. 
http://api.freebase.com/api/experimental/topic/standard?id=/m/072x72)

What steps will reproduce the problem?
1. Load a file with about 200.000 rows
2. Choose a webservice and add new column using "Add column by fetching URLs" 

What is the expected output? What do you see instead?
Job done or in case of exception at least resume option (this will be very 
useful for long running tasks).

Exception in thread "Thread-8" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Unknown Source)
        at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
        at java.lang.AbstractStringBuilder.insert(Unknown Source)
        at java.lang.StringBuffer.insert(Unknown Source)
        at com.google.refine.util.ParsingUtilities.readerToString(ParsingUtilities.java:109)
        at com.google.refine.util.ParsingUtilities.inputStreamToString(ParsingUtilities.java:96)
        at com.google.refine.operations.column.ColumnAdditionByFetchingURLsOperation$ColumnAdditionByFetchingURLsProcess.fetch(ColumnAdditionByFetchingURLsOperation.java:283)
        at com.google.refine.operations.column.ColumnAdditionByFetchingURLsOperation$ColumnAdditionByFetchingURLsProcess.run(ColumnAdditionByFetchingURLsOperation.java:223)
        at java.lang.Thread.run(Unknown Source)

What version of Google Refine are you using?
google-refine-2.5-r2407

What operating system and browser are you using?
Windows 7 64 Bit, Firefox, Chrome, Java 64 Bit

Is this problem specific to the type of browser you're using or it happens in 
all the browsers you tried?
Not related to the browser

Please provide any additional information below.
Using 64 Bit Java (JDK)
Google Refine options:
-Xms512M
-Xmx4096M
-XX:PermSize=128m 
-XX:MaxPermSize=192m

Original issue reported on code.google.com by demonsteam on 30 May 2012 at 4:30

GoogleCodeExporter commented 8 years ago

Sorry about the problem.  You don't happen to have the server console log for 
this run do you?  Refine logs the actual amount of heap that it thinks it's 
using when it starts up.  I'd just like to make sure that it's actually 
recognizing the VM options that you're giving it.

Also, does the amount of time that it runs vary with the amount of memory that 
you give it?  If not, that would seem to indicate some other problem (lack of 
memory and/or page file?).

As an aside, if all you want is the descriptions, the blob API should give you 
that without all the other stuff (although I can't imagine that it's increasing 
the memory requirements all that much).

Resumable operations for something long running like this is a good suggestion, 
but would require significant infrastructure support (undo, etc).

Original comment by tfmorris on 30 May 2012 at 7:30

GoogleCodeExporter commented 8 years ago

Hi and thank you for your fast response. I don't have the server log anymore, 
but there was nothing abnormal except of the exception posted above. I have 
attached jvisualvm during the execution last time and I can confirm that the VM 
options were correct recognized (you could see the redundand options of Xms and 
Xms (default settings and custom settings), but they were successful overriden 
with my custom settings). I can reproduce this exception any time again.
I couldn't find any dependency between the time running and the memory given, 
but between the memory and the time between the requests. My last run was 
interrupted after about 48 hours by 69% progress(time between the requests was 
set to 600 millis and with the memory options listed above).
I wasn't aware with the blob api, but this is good suggestion and I'll look 
into as well. The descriptions were only example of the data, that I'm trying 
to parse, I need more of the details contaned in the json response.
In my opinion, the implementation of resumable operations wouldn't be that hard 
and time consuming, you need only to store the last operation executed with the 
given parameters and the last successful processed index of your data, so you 
can continue in case of server crash or some other disaster case. I would be 
interested to know what happens with the already downloaded temporal data and 
where it is stored (hopefully not in the memory ?). Thank you in advance.

Original comment by demonsteam on 31 May 2012 at 12:27

GoogleCodeExporter commented 8 years ago

Refine operates with all data in memory, so you need to have enough heap & 
virtual memory to hold your entire project data, plus any working storage.  
Looking at your numbers makes me think that your heap is underconfigured.  The 
example query that you give returns 18KB of data.  Multiplying that by 200K 
rows is going to use up your entire 4GB heap before taking into account object 
overhead, working storage, etc.

I would at least double the size of the heap or splitting the operation into 
chunks.  Attempting a 3 day long operation without checkpoint/restart is 
probably optimistic, so splitting things up is probably the best choice.

BTW, I don't mean to discourage you from implementing resumable operations.  
We're always happy to receive patches from folk.  I think it may be more 
complex than you realize, but if your up to the task we'll be happy to review 
the results.

Original comment by tfmorris on 31 May 2012 at 3:00

GoogleCodeExporter commented 8 years ago

I understand know why I become OutOfMemoryError, because my data size is too 
large to be kept in the heap. But when you restart google refine, you are still 
able to open existing projects, so data is stored somewhere, doesn't it? 
But you are right, I'm not aware with the architecture of the project and I 
cannot say for sure if such feature is easy to implement or not. I had thought 
already for strategy of processing chunks of the data, but hoped for easier 
solution.
I don't think that I have the time to work in the project and the develop a 
patch for resumable operations, but if I do find, I will certainly let you 
know. Thank you for your time!

Original comment by demonsteam on 31 May 2012 at 3:20

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 18 Sep 2012 at 8:15

Changed title: Support checkpointing of long running operations like Add column by fetching URL
Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

Illablu / google-refine

Support checkpointing of long running operations like Add column by fetching URL #580