Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Too many open files #99

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

assets: 2015-04-29 09:01:09 FATAL - assets: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues... com.norconex.jef4.JEFException: Cannot persist status update for job: assets at com.norconex.jef4.suite.JobSuite$2.statusUpdated(JobSuite.java:359) at com.norconex.jef4.status.JobStatusUpdater.setNote(JobStatusUpdater.java:61) at com.norconex.collector.core.crawler.AbstractCrawler.setProgress(AbstractCrawler.java:430) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:374) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /opt/spidex/./progress/latest/status/assets__assets.job (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at com.norconex.jef4.status.FileJobStatusStore.write(FileJobStatusStore.java:150) at com.norconex.jef4.suite.JobSuite$2.statusUpdated(JobSuite.java:355) ... 7 more

essiembre commented 9 years ago

Do you run many threads? Did this happen when using PDFBox to parse PDFs? File handlers should be closed properly by the Collector (and Importer) so they should not accumulate over time (unless we missed some). On the other hand, we did found file handles not closed in PDFBox code. Hence the reason for asking about PDF.

Can you reproduce consistently? If so, can you provide your config, details of your OS, number of docs crawled when this happen, and any pattern you may have seen that triggers this?

If you are running Linux or Unix, you can check if the max number of file handles per user/process is too low with ulimit and increase that maximum with the same tool.

OkkeKlein commented 9 years ago

2-4 threads on a Linux machine using PDFBox.

Only seen it once, so doubt I can reproduce.

essiembre commented 9 years ago

OK, we'll assume it is PDFBox-specific then and we'll try to either reproduce and resolve or wait for the PDFBox team to stabilize the 2.0.0 version (whichever comes first).

essiembre commented 9 years ago

FYI, a new snapshot release is available with an updated PDFBox snapshot which had quite a few fixes made to it. I saw in PDFBox commit history some of the fixes involved closing streams that were not closed before. So maybe it is worth trying again in case it fixes your error with too many files open.

OkkeKlein commented 9 years ago

Problem happened again. With 2 threads and crawling the pdf that led to the problem didn't resolve in any issues. Also an OOM occurred so maybe things get messed up by the OOM?

I have ulimit of "unlimited" btw.

essiembre commented 9 years ago

Reporting on latest progress:

The "Too many open files" issue is not linked to the memory error. It is PDFBox creating way too many tmp files when parsing PDF files, especially large ones. It is a problem that occurs on Linux/Unix only and can be bypassed for now by using an external parser, like pdftotext.

The PDFBox team are aware of this issue and they seem to be actively working on it: https://issues.apache.org/jira/browse/PDFBOX-2301

We'll keep investigating for a resolution on our end as well.

essiembre commented 9 years ago

Good news, I got rid of the "Too many open files" issue in a new Importer snapshot release by now ensuring no temp files are created on disk to preserve memory. This means entire PDFs and all their parsed resources will be loaded in memory. To make sure no file gets created at all, you have to make sure to set high enough values in your <importer> configuration for these settings (number of bytes):

<maxFileCacheSize></maxFileCacheSize>
<maxFilePoolCacheSize></maxFilePoolCacheSize>

With large enough values, you should notice a huge speed difference.

The downside is this can bring back the OOM Error if you process several large PDFs at once (when parsed, a PDF takes much more space in memory than its file representation on disk). Increase the JVM heap size if such problem occurs.

Also, even if you allocated enough heap to support huge files, when such a file is encountered, you will still see high CPU usage and it may take quite a while to process it. The PDFBox library does not seem optimized to extract text from large files.

I will keep digging for a more elegant solution (once that both is fast, does not create tons of files, and is memory conscious). In the meantime though, this solution should get you going.

Make sure to copy all libraries from this new importer snapshot and remove duplicate libs.

OkkeKlein commented 9 years ago

I thought I was using pdftotext, but the OOM (now in Dropbox) shows pdfbox.

essiembre commented 9 years ago

The latest snapshot version of the importer includes an updated version PDFBox that creates only 1 scratch file per PDFs (instead of creating thousands when processing large PDFs). So the Importer was changed to use scratch files again, instead of storing it all in memory. This should resolve the "too many open files" problem and at the same time, should prevent OOM Errors (tracked here: https://github.com/Norconex/importer/issues/9).

essiembre commented 9 years ago

Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.