Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Importing DocumentParserException on docx and pptx #94

Closed alex-kozlowski-maven closed 5 years ago

alex-kozlowski-maven commented 5 years ago

When trying to run the crawler on an intranet I am getting: com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser

I have tried to import a specific file independently and it is fine. However when I let the crawler try to import the document it fails. I have also added: <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher" detectContentType="true" detectCharset="true"/> and is seems to have no effect.

Here is my config file: ` <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml>

./output 20 0 true {URL}.docx .*(gif|jpg|jpeg|png|jpe|pcx|tif|css|js)$ ./configs/gcs/sdk-configuration.properties raw ./committer-queue/default

`

Here is the debug text from the crawl:

Mar 15, 2019 11:40:01 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,regex=.*(gif|jpg|jpeg|png|jpe|pcx|tif|css|js)$] INFO [AbstractCollectorConfig] Configuration loaded: id=Config HTTP Collector; logsDir=./output/logs; progressDir=./output/progress INFO [JobSuite] JEF work directory is: .\output\progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.8.1 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.9.1 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.1.2 (Norconex Inc.) INFO [AbstractCollector] Version: Google Cloud Search Norconex HTTP Collector Indexer Plugin v1-0.0.3 (Google Inc.) INFO [JobSuite] Running Default: BEGIN (Fri Mar 15 11:40:02 CDT 2019) INFO [HttpCrawler] Default: RobotsTxt support: true INFO [HttpCrawler] Default: RobotsMeta support: true INFO [HttpCrawler] Default: Sitemap support: false INFO [HttpCrawler] Default: Canonical links support: true INFO [HttpCrawler] Default: User-Agent: INFO [GenericHttpClientFactory] SSL: Trusting all certificates. INFO [SitemapStore] Default: Initializing sitemap store... INFO [SitemapStore] Default: Done initializing sitemap store. INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] Default: Crawling references... INFO [CrawlerEventManager] DOCUMENT_FETCHED: {URL}.docx INFO [CrawlerEventManager] CREATED_ROBOTS_META: {URL}.docx INFO [CrawlerEventManager] URLS_EXTRACTED: {URL}.docx INFO [DebugTagger] Keep-Alive=timeout=2, max=99 INFO [DebugTagger] Transfer-Encoding=chunked INFO [DebugTagger] collector.content-type=application/vnd.openxmlformats-officedocument.wordprocessingml.document INFO [DebugTagger] document.contentFamily=wordprocessor INFO [DebugTagger] Server=Apache/2.4.6 (Unix) OpenSSL/1.0.1e PHP/5.5.3 INFO [DebugTagger] collector.content-encoding=windows-1252 INFO [DebugTagger] X-Content-Type-Options=nosniff INFO [DebugTagger] Connection=Keep-Alive INFO [DebugTagger] document.contentEncoding=windows-1252 INFO [DebugTagger] binaryContent= INFO [DebugTagger] Date=Fri, 15 Mar 2019 16:40:00 GMT INFO [DebugTagger] document.reference={URL}.docx INFO [DebugTagger] X-Frame-Options=SAMEORIGIN INFO [DebugTagger] Cache-Control=public INFO [DebugTagger] collector.is-crawl-new=true INFO [DebugTagger] X-Drupal-Cache=MISS INFO [DebugTagger] document.contentType=application/vnd.openxmlformats-officedocument.wordprocessingml.document INFO [DebugTagger] collector.depth=0 INFO [DebugTagger] Expires=Sun, 19 Nov 1978 05:00:00 GMT INFO [DebugTagger] Content-Language=en INFO [DebugTagger] X-Powered-By=PHP/5.5.3 INFO [DebugTagger] Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document WARN [Importer] Could not import {URL}.docx com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2988b4e9 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2988b4e9 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150) ... 14 more Caused by: java.io.IOException: Failed to read zip entry source at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:103) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:324) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:80) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 19 more Caused by: java.io.EOFException at java.util.zip.ZipInputStream.readFully(Unknown Source) at java.util.zip.ZipInputStream.readLOC(Unknown Source) at java.util.zip.ZipInputStream.getNextEntry(Unknown Source) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.getNextEntry(ZipSecureFile.java:278) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:52) at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:100) ... 23 more INFO [AbstractCrawler] Default: Reprocessing any cached/orphan references... INFO [AbstractCrawler] Default: Crawler finishing: committing documents. INFO [AbstractCrawler] Default: 1 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] Default: Crawler completed. INFO [AbstractCrawler] Default: Crawler executed in 2 seconds. INFO [SitemapStore] Default: Closing sitemap store... INFO [JobSuite] Running Default: END (Fri Mar 15 11:40:02 CDT 2019)

I am at a bit of a loss as to what to try next.

Thanks

essiembre commented 5 years ago

In order to reproduce, can you share the faulty document and/or URL to it?

alex-kozlowski-maven commented 5 years ago

I am crawling an intranet, however I found a sample docx on the internet that is also experiencing this issue. "http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx" Thanks

essiembre commented 5 years ago

I tried with that document and it worked just fine for me. The only difference is I did not have your custom BinaryContentTagger. If you try without it, does it get parsed properly?

alex-kozlowski-maven commented 5 years ago

Looks like the problem was a lib dependency version issue. I had bad 3rd party libs. Thank you so much for the assistance.