Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

OutOfMemoryError on fetch stage #300

Closed popthink closed 8 years ago

popthink commented 8 years ago
java.lang.OutOfMemoryError: Java heap space
        at com.norconex.commons.lang.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:357)
        at com.norconex.commons.lang.io.CachedInputStream.cacheToFile(CachedInputStream.java:527)
        at com.norconex.commons.lang.io.CachedInputStream.realRead(CachedInputStream.java:325)
        at com.norconex.commons.lang.io.CachedInputStream.read(CachedInputStream.java:301)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2146)
        at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
        at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
        at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
        at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:138)
        at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:50)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:300)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:488)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:378)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:736)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

I wrote code that runs crawlers periodically.

The code runs new 50 Crawler Threads(50 Web Site)on every hour.

But It occurs this error when It had run for 3~4 days.

Thank you.

essiembre commented 8 years ago

Either reduce the number of threads or increase the memory allocated to the java virtual machine by adding the -Xmx flag to in the collector-http.sh (or .bat) file. Like this (put your own value):

java -Xmx2048m ...

I do not know about the code you wrote, but is your code running crawlers periodically a Java program that always runs? If so, it is recommended you launch each collector instances as external processes. Making sure each time the collector runs it is own JVM instance ensures memory is cleared each time. Using the OS native way to schedule runs instead is a good way to ensure this (e.g., cronjobs or Windows Task scheduler).

popthink commented 8 years ago

I tested it for 7 days.

And.. Seems that found a solution. :)

--Env-- Crawler Thread Count : 25 Instance * 5 Thread Delay : 100ms OS : Debian 7 Java : java7, -Xmx10G

Swiping Code after 1 cycle finished:

MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
mbs.unregisterMBean(new ObjectName("com.norconex.collector.http.crawler:type=" +idOfCrawlerInstance));

Unregistered crawler descriptor from MBeanServer for gc(maybe?).

Then now it doesn't occur OOM.

I'm not sure it is a proper solution but it works on my case.

Thank you :)

essiembre commented 8 years ago

Thanks for providing this valuable feedback. Since JMX support does not benefit the vast majority of users, it is now disabled by default in the latest snapshot release. It can be enabled by adding the JVM argument -DenableJMX=true. Maybe this will bring slight performance improvement too.