Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Collector process not exiting #708

Closed MathewLeung closed 4 years ago

MathewLeung commented 4 years ago

Hi

We have observed a periodic problem, where the collector process does not exit even when the log indicates the process is completed. The collector (v2.9.1) runs in a docker container based on instructions found here. In the start script we have a line to echo a success or error when the collector process returns, but occasionally we may not see the return. Please see attached log entries below

A 2020-07-11T16:53:11.095961101Z INFO  [GoogleCloudSearchCommitter] Document (text/html) indexed (213 KB / 1288ms): https://a.domain.com/index.html
A 2020-07-11T16:53:11.096007563Z INFO  [GoogleCloudSearchCommitter] Indexing Service release reference count: 1
A 2020-07-11T16:53:11.096015720Z INFO  [GoogleCloudSearchCommitter] Stopping indexingService: 0 
A 2020-07-11T16:53:11.097438429Z Jul 11, 2020 4:53:11 PM com.google.enterprise.cloudsearch.sdk.BatchRequestService shutDown 
A 2020-07-11T16:53:11.097483701Z INFO: Shutting down batching service. flush on shutdown: true 
A 2020-07-11T16:53:14.808163240Z INFO  [GoogleCloudSearchCommitter] Shutting down (took: 3712ms)! 
A 2020-07-11T16:53:14.808202056Z INFO  [GoogleCloudSearchCommitter] Indexing Service reference count: 0 
A 2020-07-11T16:53:14.946960495Z INFO  [AbstractCrawler] Crawler A: 4310 reference(s) processed. 
A 2020-07-11T16:53:14.947056707Z INFO  [CrawlerEventManager]          CRAWLER_FINISHED 
A 2020-07-11T16:53:14.947143098Z INFO  [AbstractCrawler] Crawler A: Crawler completed. 
A 2020-07-11T16:53:14.948546999Z INFO  [AbstractCrawler] Crawler A: Crawler executed in 5 hours 1 minute 18 seconds. 
A 2020-07-11T16:53:14.948568748Z INFO  [SitemapStore] Crawler A: Closing sitemap store... 
A 2020-07-11T16:53:14.953454126Z INFO  [JobSuite] Running Crawler A: END (Sat Jul 11 11:51:56 UTC 2020) 
<!-- occasionally the log entries stop at above line, and VM process will stall and consume minimal CPU resource. We expect the log entries below to display after index items are committed and crawl completes successfully -->
2020-07-15Txx:xx:xxZ INFO  [JobSuite] Running Cloud Search HTTP Collector: END 
2020-07-15Txx:xx:xxZ crawl process exited successfully   <--- echoed from the start.sh after command 'collector-http.sh -a start -c ...', but does not always reach this point

start.sh

#!/bin/bash
#set -x
#set -e

${CRAWLER_HOME}/collector-http.sh -a start -c ${CRAWLER_HOME}/config/crawler-config.xml
if [ $? == 0 ]
then
    echo "$(date) crawl process exited successfully";
else
    echo "$(date) - Error occurred running crawler.";
fi

Is there a good explanation why this happens? The collector is using the GoogleCloudSearchCommitter to create index items for the pages crawled, wonder if the committer plays a role here?

The start.sh contains some scheduling logic to kick off the crawl process regularly, not receiving the return signal from the process breaks the continuous crawl design. Any idea to troubleshoot this problem would be a great help, cheers.

essiembre commented 4 years ago

Though one. The only pointer that comes to mind is a file lock not being released. Can you reproduce with any other committer (like XMLFileCommitter)? If so, please attach a config that can reproduce. I would investigate if there are any filehandles (or connections), not released. when you see that behavior. On linux, given you have the process id, you can find how many open file descriptors are used by that process with lsof. There are other tools as well I am sure.

If that's the problem, I recommend you investigate using https://file-leak-detector.kohsuke.org/ for troubleshooting which part of the code does it.

MathewLeung commented 4 years ago

Though one. The only pointer that comes to mind is a file lock not being released. Can you reproduce with any other committer (like XMLFileCommitter)? If so, please attach a config that can reproduce. I would investigate if there are any filehandles (or connections), not released. when you see that behavior. On linux, given you have the process id, you can find how many open file descriptors are used by that process with lsof. There are other tools as well I am sure.

If that's the problem, I recommend you investigate using https://file-leak-detector.kohsuke.org/ for troubleshooting which part of the code does it.

Thank you for the idea @essiembre

Would this indicate it is not related to file lock? (lsof is not available in the container)

$ docker exec 3b925e21a07d cat /proc/sys/fs/file-nr
2624    0       1634747

I will need to find a way to try reproduce the problem, will be quite tricky. Crawler ran in a VM and was configured to restart between each full domain crawl (with attached disk to store the states). This problem appeared on 2 separate VMs (there are few others) with different config files, after 9th and 14th crawl, so I believe they might have hit a similar situation.

Apart from trying with XMLFileCommitter, would you recommend enabling debug logs? Looking at all possible configs to change before re-running, as it took more than a week before this problem showed up on two of the VMs. Thanks again

essiembre commented 4 years ago

Debug-level logs may help, but I doubt it will tell you which class does not want to let go of a file (if that is the issue). Worth a try.

Unfortunately, I am not sure what we can conclude from your file-nr command. If not mistaken, the second value is supposed to mean number of free file handles. Zero would be alarming, but from online literature, it is apparently always zero since Linux kernel 2.6. You could increase the ulimit to make it unlimited as a test, but I would not expect much on that one.

Is something else running on that VM? Maybe different processes competing for resources caused this corrupted state? It could explain the randomness of it.

Can you monitor resource consumption (CPU, RAM, Disk...) and go back and see what resources look like when this happens?

MathewLeung commented 4 years ago

I think you are right on the output of file-nr, as other healthy VMs (docker host is running on Linux 4.19.112+) all returning 0 on the second value.

Will install the necessary tools (lsof, procps, top etc) to help troubleshoot when this problem shows up again in the container, hope to reproduce issue quickly within days rather than weeks.

We do have headless crawling enabled (replaced phantomjs with puppeteer), and did not notice anything suspicious based on the log output. But since you suspect this could be file related, I will revisit and keep a closer eye in that area.

I will provide more details next time when this problem happens, without the correct tools I am not able to supply you more detailed info. Could you please leave this issue open, as it may take some time before the problem shows again. Thanks again for your help Pascal. And thanks for linking the other issue I raised against the committer.

MathewLeung commented 4 years ago

Unable to reproduce this, closing issue. Thank you for your time @essiembre

FYI: The only change that was made to troubleshoot this issue was the upgrade of Puppeteer from 5.0.0 to 5.2.1, and the nodejs runtime to support Puppeteer 5.2.1.