Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@32acd2c #72

Closed jmrichardson closed 6 years ago

jmrichardson commented 6 years ago

I am receiving many errors in what looks like files which have an embedded file. In my case, .msg files (exchange messages) containing attachments. Files (.msg) without attachments appear to be imported. The errors I am receiving are similar to:

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@32acd2c
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154)
        at com.norconex.importer.Importer.parseDocument(Importer.java:414)
        at com.norconex.importer.Importer.importDocument(Importer.java:313)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
        at com.norconex.importer.Importer.importDocument(Importer.java:190)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:224)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@32acd2c
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150)
        ... 14 more
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
        at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
        at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
        at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
        at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
        at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 19 more

Looking through other user issues, it appears to be related to #67 . However, this doesn't appear to be a valid configuration for the filesystem crawler:

<documentFetcher detectContentType="true" detectCharset="true"/>

When looking at the error directory with the associated meta.txt, it doesn't appear to have determined the content-type or content-encoding:

collector.content-type = 
document.contentFamily = email
collector.lastmodified = 1307996638000
collector.content-encoding = 
collector.is-crawl-new = true
document.contentType = application/vnd.ms-outlook
collector.filesize = 46592

I have also tried different variations of the documentParserFactory setting for a .msg containg a .msg but no luck there:

       <documentParserFactory>
          <embedded>
          <noExtractContainerContentTypes>application/vnd.ms-outlook</noExtractContainerContentTypes>
          <noExtractEmbeddedContentTypes>application/vnd.ms-outlook</noExtractEmbeddedContentTypes>
        </embedded>
      </documentParserFactory>

Thank you for your help

essiembre commented 6 years ago

Can you share one of your .msg file with attachment? You can use email if too sensible.

The same content-type/charset detection feature could be added to the Filesystem Collector document fetcher, but before, that, can you try to parse the .msg file directly using the Importer only (you may have to download the Importer separately to have the launch scripts for it).

When not explicitly providing them (like the collector does) the importer should try to detect.

jmrichardson commented 6 years ago

Sorry for the delay... I am working on getting you a test file but in the mean time, I ran the importer on a problem file:

./importer.sh -i ./SHSS_0003022_CONFIDENTIAL.msg
Dec 14, 2017 10:29:47 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

WARN  [Importer] Could not import /home/es/elastic/norconex-importer-2.8.0/./SHSS_0003022_CONFIDENTIAL.msg
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154)
        at com.norconex.importer.Importer.parseDocument(Importer.java:414)
        at com.norconex.importer.Importer.importDocument(Importer.java:313)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
        at com.norconex.importer.Importer.importDocument(Importer.java:190)
        at com.norconex.importer.Importer.importDocument(Importer.java:179)
        at com.norconex.importer.ImporterLauncher.launch(ImporterLauncher.java:102)
        at com.norconex.importer.Importer.main(Importer.java:118)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150)
        ... 7 more
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
        at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
        at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
        at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
        at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
        at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 12 more
   ERROR: /home/es/elastic/norconex-importer-2.8.0/./SHSS_0003022_CONFIDENTIAL.msg (org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544)

It seems that some embedded msg files work but others don't. I am looking into what could be the cause but wanted to share thus far.

jmrichardson commented 6 years ago

Hi, I am sending you an example problem .msg file via email. Thank you very much for your help.

essiembre commented 6 years ago

I was able to reproduce with your file. I created a ticket with Apache Tika: https://issues.apache.org/jira/browse/TIKA-2530

Until it gets fixed in Tika, I have made a fix I invite you to test in the Importer latest snapshot. Please confirm.

jmrichardson commented 6 years ago

Thank you so much! I am re-running the crawler with the latest importer snapshot installed. I will keep you posted on the results. Thanks.

jmrichardson commented 6 years ago

The latest update fixed the issue! Thanks again for your help :)

tballison commented 6 years ago

@jmrichardson , would you be willing to share the triggering file with me personally. I'd like to dig in to see if this is a problem with Apache POI or if @essiembre 's fix is the best we can do.

jmrichardson commented 6 years ago

@tballison , yes I can share with you. Please send me your email address and will send it over (or any way you prefer)

tballison commented 6 years ago

tallison AT mitre . org There's a chance I won't have time to look until the new year, but it would be helpful for figuring out if we need to fix something in POI. Thank you!