Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Bug - crawl job hangs #478

Closed danizen closed 3 years ago

danizen commented 6 years ago

In https://github.com/Norconex/collector-http/issues/477, I diagnosed a serial problem where my crawling job experienced a Fatal OutOfMemoryError, and then later an attempt to stop the collector failed, because the JVM to be stopped would not exit.

It seems likely that the crawler job entered a terminal state, but the code was waiting for it to stop logically, except that it had failed or something instead.

The exception that produced this state was:

FATAL [JobSuite] Fatal error occured in job: monitor_lessdepth_crawler
INFO  [JobSuite] Running monitor_lessdepth_crawler: END (Tue Feb 06 17:13:41 EST 2018)
Exception in thread "monitor_lessdepth_crawler" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.lang.StringCoding.safeTrim(StringCoding.java:79)
        at java.lang.StringCoding.encode(StringCoding.java:365)
        at java.lang.String.getBytes(String.java:941)
        at org.apache.http.entity.StringEntity.<init>(StringEntity.java:70)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:589)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:537)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.resumeExecution(AbstractCrawler.java:190)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:51)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.job.group.AsyncJobGroup.runJob(AsyncJobGroup.java:119)
        at com.norconex.jef4.job.group.AsyncJobGroup.access$000(AsyncJobGroup.java:44)
        at com.norconex.jef4.job.group.AsyncJobGroup$1.run(AsyncJobGroup.java:86)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

For me, reducing the size of the elasticsearch commitSize resolved the problem, but still worth preventing job crawl hangs.

essiembre commented 6 years ago

Often OOM exceptions can't be recovered from and as such cannot be handled reliably. The JVM application state is already compromised the moment you get this and killing/restarting with more memory is usually the best approach.

Still, if you want to prevent hangs, the best options likely is to use that JVM trick with a kill command or equivalent (from Oracle JVM documentation):

-XX:OnOutOfMemoryError="<cmd args>; <cmd args>"

As of Java 8u92, you can also use those JVM argument (described here):

-XX:ExitOnOutOfMemoryError
-XX:CrashOnOutOfMemoryError 

The next major release will require Java 8 so the launch scripts shipped with the collector may be modified to include one of the Java 8 arguments.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.