Open GoogleCodeExporter opened 8 years ago
Do you have 64bit Java installed and also set as your default ? In other
words, Java_Home env variable ? You'll need 64bit Java in order to go beyond
3.5 GB RAM usage. See attached screenshot.
Original comment by thadguidry
on 20 Nov 2010 at 4:24
Attachments:
Addressed the x64 Java issue, set Xmx5120 and tried again - after 5 minutes of
loading, then got
HTTP ERROR 500
Problem accessing /command/core/create-project-from-upload. Reason:
GC overhead limit exceeded
Caused by:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at au.com.bytecode.opencsv.CSVParser.parseLine(Unknown Source)
at au.com.bytecode.opencsv.CSVParser.parseLineMulti(Unknown Source)
at com.google.refine.importers.TsvCsvImporter.getCells(TsvCsvImporter.java:196)
at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:163)
at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:74)
Original comment by leebel...@gmail.com
on 20 Nov 2010 at 9:03
Is that a typo in your comment or did you really not have a trailing 'M' on
your size? If you used 5120 instead of 5120M, you probably set your heap to
5120 bytes or perhaps 5120 KBytes, neither of which will work very well.
Original comment by tfmorris
on 20 Nov 2010 at 3:11
Also...look at the 2nd line in your command window when you start refine with:
C:\your_path_to_refine.bat_file\refine /m 5132m
The 2nd line in command window log should show how much memory was successfully
executed to the Java.exe process.
Another thing to look at is start Task Manager, click on Processes tab, and
look at how much memory is being utilized for Java.exe process.
Finally double check things at our [FaqAllocateMoreMemory]
Original comment by thadguidry
on 20 Nov 2010 at 3:23
I actually did have "Xmx5120m". I then did another run using:
refine.bat /m 6000m >refine.out
and, after about 30 minutes got
HTTP ERROR 500
Problem accessing /command/core/create-project-from-upload. Reason:
Java heap space
Caused by:
java.lang.OutOfMemoryError: Java heap space
The 'Peak working set memory' got to around 6295076K. I've attched the
refine.out file. Any help would be appreciated (and thank you for the prompt
help that you have given so far - it is appreciated)
Original comment by leebel...@gmail.com
on 21 Nov 2010 at 5:32
Attachments:
Can you file split spatialinfo2.csv and try again ? That's a BIG file. I've
tested with a 1 GB file before and 4 columns, but the data was interspersed and
so Refine absorbed it in about 10 mins. Your file on the other hand, might
just be TOO BIG for the current architecture. Anyone else have ideas for this
fellow ?
Original comment by thadguidry
on 21 Nov 2010 at 5:52
Yes, sorry it is a big file: 17 million records by 4 columns. It can split it,
but it makes the analysis considerably more complicated. Refine looked like THE
ideal way to analyse these records.
The file contains the complete location records of all species occurrences in
the Australian region for the Atlas of Living Australia (www.ala.org.au: I am
the Spatial Data Manager). Column 1 and 2 are latitude longitude in decimal
degrees. Column 3 is spatial accuracy (numeric, text and both) and column 4 is
text description of location.
I am trying to analyse all the variations of column 3, and in conjunction with
columns 1,2 and 4, can develop an estimate for spatial uncertainty.
Needless to say, any ideas would be greatly appreciated. I certainly have
greatly appreciated your help on this one.
I went out Saturday to purchase 4 x 2gb memory sticks in the hope that 8GB
would suffice. I did BTW have 1GB setup as paging on an SDD. I will try to see
if 6500mb gets me closer.
Original comment by leebel...@gmail.com
on 21 Nov 2010 at 8:28
That should be plenty of memory for this case unless Refine is being grossly
inefficient or the text descriptions are enormous. What is the total raw
(uncompressed) size of the input data?
Thad - for your 1M row case, what was the size of the input data and what was
the resulting virtual size of the Refine process? It will vary by data type,
but I'd expect memory usage to be basically linear in this range (1M-17M rows).
Original comment by tfmorris
on 21 Nov 2010 at 12:43
My test filesize was 1 GB on disk. 1 million rows, interspersed data
along 4 columns - it was an injection of the NFDC data, so my column 1
was REALLY long, like 800 chars at times I recall, 20% blanks in
columns 3 & 4.
The virtual size of the Refine (Gridworks) NFDC test project came out
to around 350 - 400 MB. hmm...maybe the blanks helped reduce that
here?
After 10 minutes of importing (back in 1.1 days) my Java.exe. process
peaked to 800 MB in Windows7 using Java64bit and 8 GB Ram for heap.
During my initial testing...the automatic saving project to disk was a
bit too aggressive and David tuned it a bit in Issue-3.
I'm thinking that he could probably use Rapid Miner instead to handle
his analysis. It is a good match for doing exactly that kind of
analysis as well. You might want to download it and give it a try.
We have the link under RelatedSoftware on wiki.
Still, I think that we probably need to go back and really test
Refine's memory utilization (post 1.1) to make sure that it is within
parameters still. I haven't done it to that capacity in a while.
Original comment by thadguidry
on 21 Nov 2010 at 5:48
Thanks for the reference to Rapid Miner. I will take a look. But I hope you are
not admitting defeat for Refine on my data :) I'd be happy for you guys to take
a look at our data as a test case. To me, it looked like a classic fit for
Refine.
The zipped file is 40mb (attachment limit is 10mb) so I've put it here:
http://dl.dropbox.com/u/8650868/spatialinfo2.zip.
Please let me know when you have it (or if you don't want it). Thanks again for
your support on this issue. Impressive.
Original comment by leebel...@gmail.com
on 21 Nov 2010 at 8:39
I downloaded it. Sure enough, so far I get the same results you do which is
unfortunate. Using Refine /m 6144m and Java.exe climbed to 6545m usage and
seemed to progress well until it got to 66% uploading complete, then tanked and
rapidly swelled to 100% complete and the Error 500 Java heap space all within 5
mins.
Thanks we'll investigate more and let you know. (diving into Profiling now)
Original comment by thadguidry
on 22 Nov 2010 at 12:08
Attachments:
with JAVA_OPTIONS="-XX:-UseParallelGC" and 8144m got to 82% and then
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at au.com.bytecode.opencsv.CSVParser.parseLine(Unknown Source)
at au.com.bytecode.opencsv.CSVParser.parseLineMulti(Unknown Source)
at com.google.refine.importers.TsvCsvImporter.getCells(TsvCsvImporter.java:196)
at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:163)
at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:74)
at com.google.refine.commands.project.CreateProjectCommand.internalInvokeImporter(CreateProjectCommand.java:478)
at com.google.refine.commands.project.CreateProjectCommand.load(CreateProjectCommand.java:341)
at com.google.refine.commands.project.CreateProjectCommand.internalImportFile(CreateProjectCommand.java:327)
at com.google.refine.commands.project.CreateProjectCommand.internalImport(CreateProjectCommand.java:169)
at com.google.refine.commands.project.CreateProjectCommand.doPost(CreateProjectCommand.java:112)
at com.google.refine.RefineServlet.service(RefineServlet.java:170)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
Original comment by thadguidry
on 22 Nov 2010 at 1:15
Switching testing on my Win7 system to JDK 6u21 and enabling GC1 experimental
for differencing profile.
Original comment by thadguidry
on 22 Nov 2010 at 1:41
Issue appears to be within CSVParser or the wiring to cells. Refine was not
originally designed to handle more than 100,000 rows. This will require a
revisit with possible underlying architecture changes in later revisions. (in
other words, we're admitting defeat with spending more time working on the
issue until we can devote more time to thinking through the architecture
redesign to handle larger datasets such as this)
Thanks again for trying Google Refine and don't lose hope, we'll get there I'm sure. Like Tom says, this file should be easy cheesy. For instance, I was able to fully open your .csv file in Notepad++ with it taking only about 600MB of memory, so it should be possible in Refine as well in theory. Just need time to track down the bugs (yet discovered) it revolves around.
Original comment by thadguidry
on 22 Nov 2010 at 4:29
Thanks Guys. Appreciate the work you put it. It would have been nice to Refine
the file, but if it can help identify and address issues, not a total loss.
Original comment by leebel...@gmail.com
on 22 Nov 2010 at 5:12
I'm getting a 404 from the dropbox url. Was it a one time download?
Thad - since you have the only copy, can you post the memory profile and the
results of your debugging?
Original comment by tfmorris
on 22 Nov 2010 at 5:53
Sorry, thought you had finished with it. I've put it back up at
http://dl.dropbox.com/u/8650868/spatialinfo2.zip
Original comment by leebel...@gmail.com
on 22 Nov 2010 at 6:03
Thanks for putting it back. I've got my copy, although Stefano or David may
want one as well. We can probably arrange to share among the team members if
you want to take your copy down.
From the back of the envelope calculations that I did, assuming that the file
is relatively homogeneous, you're looking at heap requirements of over 8GB to
import the whole file. If you set your max heap size to, say 9 or 10 GB, and
had a sufficiently sized page file, you should be able to get it imported, but
you'll be hitting disk for paging with every pass through the data, which could
put a significant damper on performance (really depends on the access
characteristics of the algorithms that you end up using for your analysis,
although I'd assume the vast majority of them are just linear sweeps through
the rows).
A typical row contains four cells: two doubles, an empty cell, and an 88
character string, totaling 480 bytes (on a 64-bit machine, it'll be slightly
less on a 32-bit processor).
I think we can probably do better than this, but for now that's what you're
dealing with...
Original comment by tfmorris
on 23 Nov 2010 at 6:10
Thanks. I took down the file again but happy to put it back up if you need it.
I looked at RapidMiner but it's way too broad a system for me to get into for
this one application.
So, I split the file in half and got the first half into Refine with no
problems. Now I'm starting to come to grips with it. Even in half, many
operations take a while - but that is AOK with me. Slow is fine, busted is
something else.
Thanks again for your support with this one!
Original comment by leebel...@gmail.com
on 23 Nov 2010 at 6:44
I've updated the header to align better with the actual issue. I'm not sure
it's something that's fixable, but I'll leave it open as a data point for some
future person working on memory performance optimization.
Original comment by tfmorris
on 7 Jan 2011 at 4:28
Thanks. When I get time, I'm still plugging away using half the file (using
"refine /m 6000m"): Refine is a very neat tool.
Original comment by leebel...@gmail.com
on 9 Jan 2011 at 9:18
Issue 346 has been merged into this issue.
Original comment by dfhu...@gmail.com
on 11 Mar 2011 at 7:48
Hello,
I am also a biologist working with large files, and I have the same issue that
is discussed above when trying to load a 3.0G file in Refine. I think that my
database contains many, many more cells than the example above. That is, many
more columns but fewer rows. Haas there been any progress on fixing this bug
since March? Thanks!
Original comment by dylan.o....@gmail.com
on 28 Jun 2012 at 8:30
Dylan,
You may want to look at Taverna http://www.taverna.org.uk/ for your specific
needs instead.
Original comment by thadguidry
on 28 Jun 2012 at 3:30
Hello
I was trying to do clustering of the rows using the text clustering feature for
50,000 rows. Initially my file size wa around 800,000 rows but i reduced the
file size to 50,000 rows and also increased the Vm memory to 5120M in my
machine. I have a mac with 8GB memory. Wonder if there is a feasible solution
for data clustering of rows? My file size is 1.1 MB. Have anyone in the past
were successful in using text mining feature with large files? Any comments or
suggestions is greatly appreciated.
Veeresh
Original comment by vthumm...@gmail.com
on 9 Apr 2014 at 3:05
Hello,
I work in a software company in Brazil and we are currently developing on a
tool for data cleansing, but only for datasets in the order of millions of
records. I would love to hear your problems and help them. We will have a free
version of our tool.
I leave my email: pedro.magalhaes@stoneage.com.br
vthumm, dylan, leebel feel free to contact me.
Original comment by pedror...@gmail.com
on 18 Sep 2014 at 8:09
Original issue reported on code.google.com by
leebel...@gmail.com
on 20 Nov 2010 at 3:59