Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

PDF parsing - TikaException: zip bomb detected #221

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

hi! some PDFs still cannot be parsed:

www.db.com: 2016-01-18 11:54:17 WARN - Could not import https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Zip bomb detected!
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:169)
        at com.norconex.importer.Importer.parseDocument(Importer.java:422)
        at com.norconex.importer.Importer.importDocument(Importer.java:318)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
        at com.norconex.importer.Importer.importDocument(Importer.java:195)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:298)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:123)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
        ... 14 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
        at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234)
        at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
        at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
        at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
        at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.writeParagraphStart(EnhancedPDF2XHTML.java:426)
        at org.apache.pdfbox.text.PDFTextStripper.handleLineSeparation(PDFTextStripper.java:1448)
        at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:674)
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
        at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:143)
        at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:168)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        ... 17 more

Is it something that can be configured? Number of nested XML elements or the ration input/output bytes? Thank you!

essiembre commented 8 years ago

The issue is as you suspected: there are too many nested elements for the maximum supported by Tika. I increased the maximum in a copy of the faulty class until a more formal fix is provided by the Tika team (Tika currently does not make this maximum configurable).

To try the fix, can you try to replace the norconex-importer-[VERSION].jar in your installation with the one from the latest Importer snapshot release.

I updated the corresponding Tika ticket with this issue: https://issues.apache.org/jira/browse/TIKA-741

tballison commented 8 years ago

I think this is an issue with norconex's EnhancedPDF2XHTML class...see TIKA-741 for the recommended modification. Give that a try with the max set to 100 and let us know if you're good to go.

Y, I just tested removing those lines from our code, and I hit the zip bomb exception.

essiembre commented 8 years ago

That did it, thanks! I committed the fix in our Importer module and will create a new release of Importer a bit later.

essiembre commented 8 years ago

I just made a new HTTP Collector snapshot release with the updated importer. @jetnet, please give it a try and confirm.

jetnet commented 8 years ago

Seems to be working with norconex-collector-http-2.4.0-20160223.174916-28.zip Thanks! I'm going to start a full crawl, to check, if all PDFs can be parsed now.

essiembre commented 8 years ago

2.4.0 has now been officially released with this fix. Please create a new ticket if the issue persists.