Closed danizen closed 6 years ago
Maybe once this happens, that crawler will not stop:
FATAL [JobSuite] Fatal error occured in job: monitor_lessdepth_crawler
INFO [JobSuite] Running monitor_lessdepth_crawler: END (Tue Feb 06 17:13:41 EST 2018)
Exception in thread "monitor_lessdepth_crawler" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.lang.StringCoding.safeTrim(StringCoding.java:79)
at java.lang.StringCoding.encode(StringCoding.java:365)
at java.lang.String.getBytes(String.java:941)
at org.apache.http.entity.StringEntity.<init>(StringEntity.java:70)
at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:589)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:537)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
at com.norconex.collector.core.crawler.AbstractCrawler.resumeExecution(AbstractCrawler.java:190)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:51)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
at com.norconex.jef4.job.group.AsyncJobGroup.runJob(AsyncJobGroup.java:119)
at com.norconex.jef4.job.group.AsyncJobGroup.access$000(AsyncJobGroup.java:44)
at com.norconex.jef4.job.group.AsyncJobGroup$1.run(AsyncJobGroup.java:86)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
OK - at some point it got stopped and/or canceled with a lot of data pending. I'm not sure whether it is the number of items in the batch or the size of one of them. Assuming the latter, I see this:
mpluscrawl@dvlbmpluscrawladm: ~/crawldata/monitor/reduced/queue/2018$ find . -type f -size 30k -print
./03-26/03/28/44/1522092524919000000-add.cntnt
./02-20/04/37/02/1519162622629000000-add.cntnt
./02-20/04/07/01/1519160821818000000-add.meta
./02-20/04/32/37/1519162357541000000-add.meta
I think 30k of metadata is a lot.
So, it looks like that OutOfMemoryError is for converting the entire commit to JSON. My configuration is as follows:
queueSize - 500 commitSize - 2000
I take that to mean that I messed up, and t he commitSize should be smaller than the queue size. I also have the following:
.../queue$ find . -type f -name '*.ref' | wc -l
1043
So, I guess I will reduce the commitSize to 250 and see what happens.
OK - still failed to commit to elasticsearch, but the job terminated anyway. So bug exists - when an OutOfMemoryError
occurs in a running crawler, the collector will not exit later.
This is finally resolved for me, and I will refile as a more specific issue.
I have a workflow problem. I want to "resume" my crawler every day, and then let it run most of the day, and then "stop" my crawler.
However, the collector JVM is no longer executing when it is done. I end-up here: (the output is thread state when I send SIGQUIT to the JVM):
What log messages can I look for that should indicate that important threads of the crawlers and collectors are done?