Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

PDF cannot be parsed: EnhancedPDFParser - NullPointerException #216

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

Hi Pascal,

I've got a lot of PDFs, which cannot be imported, because of a NullPointerException in EnhancedPDFParser , e.g.:

test: 2016-01-11 10:45:23 DEBUG - Could not import https://japan.db.com/docs/150730_2Q_PRESS_RELEASE_J_Final.pdf
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@2668c55
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:169)
    at com.norconex.importer.Importer.parseDocument(Importer.java:422)
    at com.norconex.importer.Importer.importDocument(Importer.java:318)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
    at com.norconex.importer.Importer.importDocument(Importer.java:195)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:298)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:487)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:723)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@2668c55
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
    at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
    at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
    ... 14 more
Caused by: java.lang.NullPointerException
    at org.apache.tika.parser.pdf.EnhancedPDFParser.extractMultilingualItems(EnhancedPDFParser.java:404)
    at org.apache.tika.parser.pdf.EnhancedPDFParser.extractMetadata(EnhancedPDFParser.java:296)
    at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:158)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 18 more

could you please take a look at this? Thanks!

P.S. It would be helpful, if such "could not import" messages could be logged as "WARNINGS", and not as "DEBUG". Thanks!

jetnet commented 8 years ago

FYI: the "standard" tika lib can parse the mentioned document without any problem: java -jar tika-app-1.11.jar 150730_2Q_PRESS_RELEASE_J_Final.pdf

essiembre commented 8 years ago

The Tika lib does not have the error but it uses version 1.x of PDFBox to parse PDFs. Norconex Importer uses release candidate version 2.x of PDFBox. That newer release fixes several issues found in the 1.x version.

The problem you found has been fixed and you can try the latest importer snapshot, also found in the latest HTTP Collector snapshot.

Please test and confirm.

jetnet commented 8 years ago

no more NPE! Thank you! Great support as usual! :smile: