PDF parsing - TikaException: zip bomb detected

jetnet commented 8 years ago

hi! some PDFs still cannot be parsed:

www.db.com: 2016-01-18 11:54:17 WARN - Could not import https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Zip bomb detected!
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:169)
        at com.norconex.importer.Importer.parseDocument(Importer.java:422)
        at com.norconex.importer.Importer.importDocument(Importer.java:318)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
        at com.norconex.importer.Importer.importDocument(Importer.java:195)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:298)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:123)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
        ... 14 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
        at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234)
        at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
        at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
        at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
        at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.writeParagraphStart(EnhancedPDF2XHTML.java:426)
        at org.apache.pdfbox.text.PDFTextStripper.handleLineSeparation(PDFTextStripper.java:1448)
        at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:674)
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
        at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:143)
        at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:168)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        ... 17 more

Is it something that can be configured? Number of nested XML elements or the ration input/output bytes? Thank you!

essiembre commented 8 years ago

The issue is as you suspected: there are too many nested elements for the maximum supported by Tika. I increased the maximum in a copy of the faulty class until a more formal fix is provided by the Tika team (Tika currently does not make this maximum configurable).

To try the fix, can you try to replace the norconex-importer-[VERSION].jar in your installation with the one from the latest Importer snapshot release.

I updated the corresponding Tika ticket with this issue: https://issues.apache.org/jira/browse/TIKA-741

tballison commented 8 years ago

I think this is an issue with norconex's EnhancedPDF2XHTML class...see TIKA-741 for the recommended modification. Give that a try with the max set to 100 and let us know if you're good to go.

Y, I just tested removing those lines from our code, and I hit the zip bomb exception.

essiembre commented 8 years ago

That did it, thanks! I committed the fix in our Importer module and will create a new release of Importer a bit later.

essiembre commented 8 years ago

I just made a new HTTP Collector snapshot release with the updated importer. @jetnet, please give it a try and confirm.

jetnet commented 8 years ago

Seems to be working with norconex-collector-http-2.4.0-20160223.174916-28.zip Thanks! I'm going to start a full crawl, to check, if all PDFs can be parsed now.

essiembre commented 8 years ago

2.4.0 has now been officially released with this fix. Please create a new ticket if the issue persists.

Norconex / crawlers

PDF parsing - TikaException: zip bomb detected #221