Importing DocumentParserException on docx and pptx

alex-kozlowski-maven commented 5 years ago

When trying to run the crawler on an intranet I am getting: com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser

I have tried to import a specific file independently and it is fine. However when I let the crawler try to import the document it fails. I have also added: <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher" detectContentType="true" detectCharset="true"/> and is seems to have no effect.

Here is my config file: ` <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml>

./output 20 0 true {URL}.docx .*(gif|jpg|jpeg|png|jpe|pcx|tif|css|js)$ ./configs/gcs/sdk-configuration.properties raw ./committer-queue/default

`

Here is the debug text from the crawl:

Mar 15, 2019 11:40:01 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=.*(gif|jpg|jpeg|png|jpe|pcx|tif|css|js)$] INFO [AbstractCollectorConfig] Configuration loaded: id=Config HTTP Collector; logsDir=./output/logs; progressDir=./output/progress INFO [JobSuite] JEF work directory is: .\output\progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.8.1 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.9.1 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.1.2 (Norconex Inc.) INFO [AbstractCollector] Version: Google Cloud Search Norconex HTTP Collector Indexer Plugin v1-0.0.3 (Google Inc.) INFO [JobSuite] Running Default: BEGIN (Fri Mar 15 11:40:02 CDT 2019) INFO [HttpCrawler] Default: RobotsTxt support: true INFO [HttpCrawler] Default: RobotsMeta support: true INFO [HttpCrawler] Default: Sitemap support: false INFO [HttpCrawler] Default: Canonical links support: true INFO [HttpCrawler] Default: User-Agent: INFO [GenericHttpClientFactory] SSL: Trusting all certificates. INFO [SitemapStore] Default: Initializing sitemap store... INFO [SitemapStore] Default: Done initializing sitemap store. INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] Default: Crawling references... INFO [CrawlerEventManager] DOCUMENT_FETCHED: {URL}.docx INFO [CrawlerEventManager] CREATED_ROBOTS_META: {URL}.docx INFO [CrawlerEventManager] URLS_EXTRACTED: {URL}.docx INFO [DebugTagger] Keep-Alive=timeout=2, max=99 INFO [DebugTagger] Transfer-Encoding=chunked INFO [DebugTagger] collector.content-type=application/vnd.openxmlformats-officedocument.wordprocessingml.document INFO [DebugTagger] document.contentFamily=wordprocessor INFO [DebugTagger] Server=Apache/2.4.6 (Unix) OpenSSL/1.0.1e PHP/5.5.3 INFO [DebugTagger] collector.content-encoding=windows-1252 INFO [DebugTagger] X-Content-Type-Options=nosniff INFO [DebugTagger] Connection=Keep-Alive INFO [DebugTagger] document.contentEncoding=windows-1252 INFO [DebugTagger] binaryContent= INFO [DebugTagger] Date=Fri, 15 Mar 2019 16:40:00 GMT INFO [DebugTagger] document.reference={URL}.docx INFO [DebugTagger] X-Frame-Options=SAMEORIGIN INFO [DebugTagger] Cache-Control=public INFO [DebugTagger] collector.is-crawl-new=true INFO [DebugTagger] X-Drupal-Cache=MISS INFO [DebugTagger] document.contentType=application/vnd.openxmlformats-officedocument.wordprocessingml.document INFO [DebugTagger] collector.depth=0 INFO [DebugTagger] Expires=Sun, 19 Nov 1978 05:00:00 GMT INFO [DebugTagger] Content-Language=en INFO [DebugTagger] X-Powered-By=PHP/5.5.3 INFO [DebugTagger] Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document WARN [Importer] Could not import {URL}.docx com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2988b4e9 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2988b4e9 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150) ... 14 more Caused by: java.io.IOException: Failed to read zip entry source at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:103) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:324) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:80) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 19 more Caused by: java.io.EOFException at java.util.zip.ZipInputStream.readFully(Unknown Source) at java.util.zip.ZipInputStream.readLOC(Unknown Source) at java.util.zip.ZipInputStream.getNextEntry(Unknown Source) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.getNextEntry(ZipSecureFile.java:278) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:52) at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:100) ... 23 more INFO [AbstractCrawler] Default: Reprocessing any cached/orphan references... INFO [AbstractCrawler] Default: Crawler finishing: committing documents. INFO [AbstractCrawler] Default: 1 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] Default: Crawler completed. INFO [AbstractCrawler] Default: Crawler executed in 2 seconds. INFO [SitemapStore] Default: Closing sitemap store... INFO [JobSuite] Running Default: END (Fri Mar 15 11:40:02 CDT 2019)

I am at a bit of a loss as to what to try next.

Thanks

essiembre commented 5 years ago

In order to reproduce, can you share the faulty document and/or URL to it?

alex-kozlowski-maven commented 5 years ago

I am crawling an intranet, however I found a sample docx on the internet that is also experiencing this issue. "http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx" Thanks

essiembre commented 5 years ago

I tried with that document and it worked just fine for me. The only difference is I did not have your custom BinaryContentTagger. If you try without it, does it get parsed properly?

alex-kozlowski-maven commented 5 years ago

Looks like the problem was a lib dependency version issue. I had bad 3rd party libs. Thank you so much for the assistance.

Norconex / importer

Importing DocumentParserException on docx and pptx #94