Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

PDF files are not extracted #450

Closed wolverline closed 6 years ago

wolverline commented 6 years ago

I think this is a Tika issue; I looked into it and it seems it was resolved before. I wonder if you ever come across this error. The message I have is:

WARN [Importer] Could not import https://xxx.com/files/pdf_file.pdf com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5f1959c2 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5f1959c2 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150) ... 14 more Caused by: java.lang.IllegalArgumentException: root cannot be null at org.apache.pdfbox.pdmodel.PDPageTree.<init>(PDPageTree.java:75) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129) at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:1398) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:243) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 19 more

The PDF has an embedded form. Does this cause this issue?

essiembre commented 6 years ago

I do not recall experiencing this myself. Can you share your PDF to reproduce?

wolverline commented 6 years ago

Hi Pascal,

It seems Tika lib has issues with pdf files that come with user-input forms embedded. I tried this pdf: http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf and got the same results.

essiembre commented 6 years ago

Which version are you using? Have you tried the latest? I was able to parse that PDF without issues and its content was extracted properly.

wolverline commented 6 years ago

I'm using the latest version of the importer(2.8.0). I tried it with another site and got the same result. I don't know if there is any missing config avoiding this issue. I found an issue in the PDFbox project site. https://issues.apache.org/jira/browse/PDFBOX-3849 (There is a link to Tika that has the error message). I thought this was fixed but Norconex is using PDFBox tool v2.0.7 and the fix was made with the version (the latest is 2.0.8). I don't know where to look.

essiembre commented 6 years ago

I just tried again and verified the PDFBox version, which is 2.0.7. It works and I attached the output, using the FilesystemCommitter.

Can you share your full config? Is it possible that you have pre-parse handlers that modify your PDF before it gets parsed? When used as pre-parse handlers, make sure transformers, taggers, and splitters are configured with the restrictTo option or it will mess-up your binaries files

wolverline commented 6 years ago

Thanks, Pascal. restrictTo did the trick. I am using a tagger for extract h1 title from the HTML page without setting restrictTo option. My new config looks like the following:

`

text/html

`