Closed OkkeKlein closed 9 years ago
Do you run many threads? Did this happen when using PDFBox to parse PDFs? File handlers should be closed properly by the Collector (and Importer) so they should not accumulate over time (unless we missed some). On the other hand, we did found file handles not closed in PDFBox code. Hence the reason for asking about PDF.
Can you reproduce consistently? If so, can you provide your config, details of your OS, number of docs crawled when this happen, and any pattern you may have seen that triggers this?
If you are running Linux or Unix, you can check if the max number of file handles per user/process is too low with ulimit
and increase that maximum with the same tool.
2-4 threads on a Linux machine using PDFBox.
Only seen it once, so doubt I can reproduce.
OK, we'll assume it is PDFBox-specific then and we'll try to either reproduce and resolve or wait for the PDFBox team to stabilize the 2.0.0 version (whichever comes first).
FYI, a new snapshot release is available with an updated PDFBox snapshot which had quite a few fixes made to it. I saw in PDFBox commit history some of the fixes involved closing streams that were not closed before. So maybe it is worth trying again in case it fixes your error with too many files open.
Problem happened again. With 2 threads and crawling the pdf that led to the problem didn't resolve in any issues. Also an OOM occurred so maybe things get messed up by the OOM?
I have ulimit of "unlimited" btw.
Reporting on latest progress:
The "Too many open files" issue is not linked to the memory error. It is PDFBox creating way too many tmp files when parsing PDF files, especially large ones. It is a problem that occurs on Linux/Unix only and can be bypassed for now by using an external parser, like pdftotext.
The PDFBox team are aware of this issue and they seem to be actively working on it: https://issues.apache.org/jira/browse/PDFBOX-2301
We'll keep investigating for a resolution on our end as well.
Good news, I got rid of the "Too many open files" issue in a new Importer snapshot release by now ensuring no temp files are created on disk to preserve memory. This means entire PDFs and all their parsed resources will be loaded in memory. To make sure no file gets created at all, you have to make sure to set high enough values in your <importer>
configuration for these settings (number of bytes):
<maxFileCacheSize></maxFileCacheSize>
<maxFilePoolCacheSize></maxFilePoolCacheSize>
With large enough values, you should notice a huge speed difference.
The downside is this can bring back the OOM Error if you process several large PDFs at once (when parsed, a PDF takes much more space in memory than its file representation on disk). Increase the JVM heap size if such problem occurs.
Also, even if you allocated enough heap to support huge files, when such a file is encountered, you will still see high CPU usage and it may take quite a while to process it. The PDFBox library does not seem optimized to extract text from large files.
I will keep digging for a more elegant solution (once that both is fast, does not create tons of files, and is memory conscious). In the meantime though, this solution should get you going.
Make sure to copy all libraries from this new importer snapshot and remove duplicate libs.
I thought I was using pdftotext, but the OOM (now in Dropbox) shows pdfbox.
The latest snapshot version of the importer includes an updated version PDFBox that creates only 1 scratch file per PDFs (instead of creating thousands when processing large PDFs). So the Importer was changed to use scratch files again, instead of storing it all in memory. This should resolve the "too many open files" problem and at the same time, should prevent OOM Errors (tracked here: https://github.com/Norconex/importer/issues/9).
assets: 2015-04-29 09:01:09 FATAL - assets: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues... com.norconex.jef4.JEFException: Cannot persist status update for job: assets at com.norconex.jef4.suite.JobSuite$2.statusUpdated(JobSuite.java:359) at com.norconex.jef4.status.JobStatusUpdater.setNote(JobStatusUpdater.java:61) at com.norconex.collector.core.crawler.AbstractCrawler.setProgress(AbstractCrawler.java:430) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:374) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /opt/spidex/./progress/latest/status/assets__assets.job (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:171)
at com.norconex.jef4.status.FileJobStatusStore.write(FileJobStatusStore.java:150)
at com.norconex.jef4.suite.JobSuite$2.statusUpdated(JobSuite.java:355)
... 7 more