PDF files are not extracted

wolverline commented 6 years ago

I think this is a Tika issue; I looked into it and it seems it was resolved before. I wonder if you ever come across this error. The message I have is:

WARN [Importer] Could not import https://xxx.com/files/pdf_file.pdf com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5f1959c2 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5f1959c2 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150) ... 14 more Caused by: java.lang.IllegalArgumentException: root cannot be null at org.apache.pdfbox.pdmodel.PDPageTree.<init>(PDPageTree.java:75) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1398) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:243) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 19 more

The PDF has an embedded form. Does this cause this issue?

essiembre commented 6 years ago

I do not recall experiencing this myself. Can you share your PDF to reproduce?

wolverline commented 6 years ago

Hi Pascal,

It seems Tika lib has issues with pdf files that come with user-input forms embedded. I tried this pdf: http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf and got the same results.

essiembre commented 6 years ago

Which version are you using? Have you tried the latest? I was able to parse that PDF without issues and its content was extracted properly.

wolverline commented 6 years ago

I'm using the latest version of the importer(2.8.0). I tried it with another site and got the same result. I don't know if there is any missing config avoiding this issue. I found an issue in the PDFbox project site. https://issues.apache.org/jira/browse/PDFBOX-3849 (There is a link to Tika that has the error message). I thought this was fixed but Norconex is using PDFBox tool v2.0.7 and the fix was made with the version (the latest is 2.0.8). I don't know where to look.

essiembre commented 6 years ago

I just tried again and verified the PDFBox version, which is 2.0.7. It works and I attached the output, using the FilesystemCommitter.

Can you share your full config? Is it possible that you have pre-parse handlers that modify your PDF before it gets parsed? When used as pre-parse handlers, make sure transformers, taggers, and splitters are configured with the restrictTo option or it will mess-up your binaries files

wolverline commented 6 years ago

Thanks, Pascal. restrictTo did the trick. I am using a tagger for extract h1 title from the HTML page without setting restrictTo option. My new config looks like the following:

`

text/html

`

Norconex / crawlers

PDF files are not extracted #450