Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

OOM crawling PDF. #267

Closed OkkeKlein closed 8 years ago

OkkeKlein commented 8 years ago

DEBUG [CachedInputStream] Deleted cache file: /tmp/CachedInputStream-4437588446019026996-temp Exception in thread "pool-1-thread-1" INFO [AbstractCrawler] My Crawler Name: Deleting orphan references (if any)... java.lang.OutOfMemoryError: Java heap space at java.awt.image.DataBufferByte.(DataBufferByte.java:92) at java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:445) at sun.awt.image.ByteInterleavedRaster.(ByteInterleavedRaster.java:90) at sun.awt.image.ByteInterleavedRaster.createCompatibleWritableRaster(ByteInterleavedRaster.java:1281) at sun.awt.image.ByteInterleavedRaster.createCompatibleWritableRaster(ByteInterleavedRaster.java:1292) at org.apache.pdfbox.filter.DCTFilter.fromBGRtoRGB(DCTFilter.java:245) at org.apache.pdfbox.filter.DCTFilter.decode(DCTFilter.java:170) at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:147) at org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) at org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.extractImages(EnhancedPDF2XHTML.java:347) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.endPage(EnhancedPDF2XHTML.java:245) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:158) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:168) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166) at com.norconex.importer.Importer.parseDocument(Importer.java:422) at com.norconex.importer.Importer.importDocument(Importer.java:318) at com.norconex.importer.Importer.doImportDocument(Importer.java:271) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)

essiembre commented 8 years ago

Try increase the JVM memory by adding the -Xms and -Xmx arguments to java in the Collector execution script. See example here.

Also, do you have many threads running? It may be that a given time too much is being processed at once. Try lowering the number of threads.

If it still fails after making these changes, please attach your file for troubleshooting.

OkkeKlein commented 8 years ago

It seems that OCR was causing the OOM.

essiembre commented 8 years ago

When enabling OCR, inline image extraction is performed on PDFs, so it may very well be the parsing of images that causes. See this thread for more details: https://github.com/Norconex/importer/issues/19.

In your case, did the memory increase fix the issue and we can close this ticket? Or does it happen no matter how much memory you give it? If the later, we should investigate how Tika does its image parsing to try make it more efficient.

OkkeKlein commented 8 years ago

I raised heap to 1024m without improvement. Then I disabled OCR and it worked again.

It can send you the culprit to test if you want.

essiembre commented 8 years ago

That would be great if you can attach your file (or send via email if confidential).

essiembre commented 8 years ago

I was able to confirm it is the image extraction that causes the problem. Images are not extracted by default when parsing PDFs. OCR enables image extraction (or there would not be any OCR possible).

It turns out to be a tricky one to resolve since the issue is with the third-party library used to parse many files (Apache Tika). I found this warning in the Tika API about extracting images on PDFParserConfig.html#setExtractInlineImages(boolean):

If true, extract inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution.

I will investigate when I get a chance if anyhting can be done to work around this issue but we may have to open a ticket on the Tika project.

essiembre commented 8 years ago

More troubleshooting shows it is Java Image IO library that ultimately tries to load all the uncompressed image in memory. I think the fix needs to happen in PDFBox DCTFilter#decode(...) class/method. A few similar memory issues have been reported to Tika/PDFBox but they have not provided a definitive fix yet.

In the mean time, I tried giving it more memory and 2GB did it for me (-Xmx2000m) on a 64-bit JVM when using it with the importer alone. Do you have that much memory available? Since other activities may take place at the same time, I would make it as high as you can (e.g. 3G) in case you end up processing more than one huge image at the same time (you can eliminate that risk by using 1 thread only -- probably not ideal).

Can you please try with a higher memory setting? If that still does not resolve it, do you agree to have the file you sent me shared with Tika/PDFBox projects?

An approach I contemplate would be to try detect the image size and estimate how big it would be uncompressed, and if too big, do not extract it from the PDF. Before putting in place such a workaround, I would rather see if a fix in the third-party library is possible.

OkkeKlein commented 8 years ago

I decided to disable OCR as the problems with it are not worth the extra text you gather. It was more a feature test.

Feel free to share any insights you have about the issue with PDFBox/Tika. Please contact me if you need to share the file and I will ask my client.

essiembre commented 8 years ago

OK then. I will close for now since giving it enough heap will work. If it becomes too frequent/problematic, please re-open and I suggest we submit your file as a test case to either Tika or PDFBox.