Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Question - how to diagnose hung multi-crawl job #477

Closed danizen closed 6 years ago

danizen commented 6 years ago

I have a workflow problem. I want to "resume" my crawler every day, and then let it run most of the day, and then "stop" my crawler.

However, the collector JVM is no longer executing when it is done. I end-up here: (the output is thread state when I send SIGQUIT to the JVM):

...
ERROR [JobSuite] JobSuite was executed successfully but job state does not reflect that: STOPPING
2018-03-30 16:19:36
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode):

"DestroyJavaVM" #94 prio=5 os_prio=0 tid=0x00007f4e7800d800 nid=0x16d6 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Java2D Disposer" #87 daemon prio=10 os_prio=0 tid=0x00007f4e5003c000 nid=0x174e in Object.wait() [0x00007f4e606db000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
        - locked <0x00000000a478c228> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
        at sun.java2d.Disposer.run(Disposer.java:148)
        at java.lang.Thread.run(Thread.java:748)

"I/O dispatcher 2" #84 prio=5 os_prio=0 tid=0x00007f4e3801e800 nid=0x1735 runnable [0x00007f4e23055000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000a06c50b8> (a sun.nio.ch.Util$3)
        - locked <0x00000000a06c50a8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000a06c4f70> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:257)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:106)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:590)
        at java.lang.Thread.run(Thread.java:748)

"I/O dispatcher 1" #83 prio=5 os_prio=0 tid=0x00007f4e38002800 nid=0x1734 runnable [0x00007f4e236f5000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000a06c6610> (a sun.nio.ch.Util$3)
        - locked <0x00000000a06c6600> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000a06c64c8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:257)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:106)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:590)
        at java.lang.Thread.run(Thread.java:748)

"pool-5-thread-1" #82 prio=5 os_prio=0 tid=0x0000000001c1f800 nid=0x1733 runnable [0x00007f4e2242d000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00000000a06c3cd0> (a sun.nio.ch.Util$3)
        - locked <0x00000000a06c3cc0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000a06c3b78> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:342)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:191)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.lang.Thread.run(Thread.java:748)

"Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007f4e780d8800 nid=0x16df runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f4e780bb800 nid=0x16de waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f4e780b8800 nid=0x16dd waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f4e780b7000 nid=0x16dc waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f4e78083800 nid=0x16db in Object.wait() [0x00007f4e7ca23000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
        - locked <0x00000000a0002160> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:212)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f4e7807e800 nid=0x16da in Object.wait() [0x00007f4e7cb24000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        - locked <0x00000000a0002390> (a java.lang.ref.Reference$Lock)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"VM Thread" os_prio=0 tid=0x00007f4e78077000 nid=0x16d9 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f4e78022800 nid=0x16d7 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f4e78024000 nid=0x16d8 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f4e780dd800 nid=0x16e0 waiting on condition

JNI global references: 2097

Heap
 PSYoungGen      total 521216K, used 192486K [0x00000000e0000000, 0x0000000100000000, 0x0000000100000000)
  eden space 518144K, 36% used [0x00000000e0000000,0x00000000eba3c7a0,0x00000000ffa00000)
  from space 3072K, 57% used [0x00000000ffa00000,0x00000000ffbbd148,0x00000000ffd00000)
  to   space 3072K, 0% used [0x00000000ffd00000,0x00000000ffd00000,0x0000000100000000)
 ParOldGen       total 1048576K, used 250562K [0x00000000a0000000, 0x00000000e0000000, 0x00000000e0000000)
  object space 1048576K, 23% used [0x00000000a0000000,0x00000000af4b0b60,0x00000000e0000000)
 Metaspace       used 52235K, capacity 53724K, committed 53888K, reserved 1097728K
  class space    used 5529K, capacity 5919K, committed 6016K, reserved 1048576K

What log messages can I look for that should indicate that important threads of the crawlers and collectors are done?

danizen commented 6 years ago

Maybe once this happens, that crawler will not stop:

FATAL [JobSuite] Fatal error occured in job: monitor_lessdepth_crawler
INFO  [JobSuite] Running monitor_lessdepth_crawler: END (Tue Feb 06 17:13:41 EST 2018)
Exception in thread "monitor_lessdepth_crawler" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.lang.StringCoding.safeTrim(StringCoding.java:79)
        at java.lang.StringCoding.encode(StringCoding.java:365)
        at java.lang.String.getBytes(String.java:941)
        at org.apache.http.entity.StringEntity.<init>(StringEntity.java:70)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:589)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:537)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:274)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
        at com.norconex.collector.core.crawler.AbstractCrawler.resumeExecution(AbstractCrawler.java:190)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:51)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.job.group.AsyncJobGroup.runJob(AsyncJobGroup.java:119)
        at com.norconex.jef4.job.group.AsyncJobGroup.access$000(AsyncJobGroup.java:44)
        at com.norconex.jef4.job.group.AsyncJobGroup$1.run(AsyncJobGroup.java:86)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
danizen commented 6 years ago

OK - at some point it got stopped and/or canceled with a lot of data pending. I'm not sure whether it is the number of items in the batch or the size of one of them. Assuming the latter, I see this:

mpluscrawl@dvlbmpluscrawladm: ~/crawldata/monitor/reduced/queue/2018$ find . -type f -size 30k -print 
./03-26/03/28/44/1522092524919000000-add.cntnt
./02-20/04/37/02/1519162622629000000-add.cntnt
./02-20/04/07/01/1519160821818000000-add.meta
./02-20/04/32/37/1519162357541000000-add.meta

I think 30k of metadata is a lot.

danizen commented 6 years ago

So, it looks like that OutOfMemoryError is for converting the entire commit to JSON. My configuration is as follows:

queueSize - 500 commitSize - 2000

I take that to mean that I messed up, and t he commitSize should be smaller than the queue size. I also have the following:

.../queue$ find . -type f -name '*.ref' | wc -l
1043

So, I guess I will reduce the commitSize to 250 and see what happens.

danizen commented 6 years ago

OK - still failed to commit to elasticsearch, but the job terminated anyway. So bug exists - when an OutOfMemoryError occurs in a running crawler, the collector will not exit later.

danizen commented 6 years ago

This is finally resolved for me, and I will refile as a more specific issue.