Closed MathewLeung closed 4 years ago
Though one. The only pointer that comes to mind is a file lock not being released. Can you reproduce with any other committer (like XMLFileCommitter)? If so, please attach a config that can reproduce. I would investigate if there are any filehandles (or connections), not released. when you see that behavior. On linux, given you have the process id, you can find how many open file descriptors are used by that process with lsof
. There are other tools as well I am sure.
If that's the problem, I recommend you investigate using https://file-leak-detector.kohsuke.org/ for troubleshooting which part of the code does it.
Though one. The only pointer that comes to mind is a file lock not being released. Can you reproduce with any other committer (like XMLFileCommitter)? If so, please attach a config that can reproduce. I would investigate if there are any filehandles (or connections), not released. when you see that behavior. On linux, given you have the process id, you can find how many open file descriptors are used by that process with
lsof
. There are other tools as well I am sure.If that's the problem, I recommend you investigate using https://file-leak-detector.kohsuke.org/ for troubleshooting which part of the code does it.
Thank you for the idea @essiembre
Would this indicate it is not related to file lock? (lsof is not available in the container)
$ docker exec 3b925e21a07d cat /proc/sys/fs/file-nr
2624 0 1634747
I will need to find a way to try reproduce the problem, will be quite tricky. Crawler ran in a VM and was configured to restart between each full domain crawl (with attached disk to store the states). This problem appeared on 2 separate VMs (there are few others) with different config files, after 9th and 14th crawl, so I believe they might have hit a similar situation.
Apart from trying with XMLFileCommitter, would you recommend enabling debug logs? Looking at all possible configs to change before re-running, as it took more than a week before this problem showed up on two of the VMs. Thanks again
Debug-level logs may help, but I doubt it will tell you which class does not want to let go of a file (if that is the issue). Worth a try.
Unfortunately, I am not sure what we can conclude from your file-nr
command. If not mistaken, the second value is supposed to mean number of free file handles
. Zero would be alarming, but from online literature, it is apparently always zero since Linux kernel 2.6. You could increase the ulimit
to make it unlimited as a test, but I would not expect much on that one.
Is something else running on that VM? Maybe different processes competing for resources caused this corrupted state? It could explain the randomness of it.
Can you monitor resource consumption (CPU, RAM, Disk...) and go back and see what resources look like when this happens?
I think you are right on the output of file-nr
, as other healthy VMs (docker host is running on Linux 4.19.112+) all returning 0 on the second value.
Will install the necessary tools (lsof, procps, top etc) to help troubleshoot when this problem shows up again in the container, hope to reproduce issue quickly within days rather than weeks.
We do have headless crawling enabled (replaced phantomjs with puppeteer), and did not notice anything suspicious based on the log output. But since you suspect this could be file related, I will revisit and keep a closer eye in that area.
I will provide more details next time when this problem happens, without the correct tools I am not able to supply you more detailed info. Could you please leave this issue open, as it may take some time before the problem shows again. Thanks again for your help Pascal. And thanks for linking the other issue I raised against the committer.
Unable to reproduce this, closing issue. Thank you for your time @essiembre
FYI: The only change that was made to troubleshoot this issue was the upgrade of Puppeteer from 5.0.0 to 5.2.1, and the nodejs runtime to support Puppeteer 5.2.1.
Hi
We have observed a periodic problem, where the collector process does not exit even when the log indicates the process is completed. The collector (v2.9.1) runs in a docker container based on instructions found here. In the start script we have a line to echo a success or error when the collector process returns, but occasionally we may not see the return. Please see attached log entries below
start.sh
Is there a good explanation why this happens? The collector is using the GoogleCloudSearchCommitter to create index items for the pages crawled, wonder if the committer plays a role here?
The start.sh contains some scheduling logic to kick off the crawl process regularly, not receiving the return signal from the process breaks the continuous crawl design. Any idea to troubleshoot this problem would be a great help, cheers.