Closed jmrichardson closed 6 years ago
Can you share one of your .msg file with attachment? You can use email if too sensible.
The same content-type/charset detection feature could be added to the Filesystem Collector document fetcher, but before, that, can you try to parse the .msg file directly using the Importer only (you may have to download the Importer separately to have the launch scripts for it).
When not explicitly providing them (like the collector does) the importer should try to detect.
Sorry for the delay... I am working on getting you a test file but in the mean time, I ran the importer on a problem file:
./importer.sh -i ./SHSS_0003022_CONFIDENTIAL.msg
Dec 14, 2017 10:29:47 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
WARN [Importer] Could not import /home/es/elastic/norconex-importer-2.8.0/./SHSS_0003022_CONFIDENTIAL.msg
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154)
at com.norconex.importer.Importer.parseDocument(Importer.java:414)
at com.norconex.importer.Importer.importDocument(Importer.java:313)
at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
at com.norconex.importer.Importer.importDocument(Importer.java:190)
at com.norconex.importer.Importer.importDocument(Importer.java:179)
at com.norconex.importer.ImporterLauncher.launch(ImporterLauncher.java:102)
at com.norconex.importer.Importer.main(Importer.java:118)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150)
... 7 more
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 12 more
ERROR: /home/es/elastic/norconex-importer-2.8.0/./SHSS_0003022_CONFIDENTIAL.msg (org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@ae13544)
It seems that some embedded msg files work but others don't. I am looking into what could be the cause but wanted to share thus far.
Hi, I am sending you an example problem .msg file via email. Thank you very much for your help.
I was able to reproduce with your file. I created a ticket with Apache Tika: https://issues.apache.org/jira/browse/TIKA-2530
Until it gets fixed in Tika, I have made a fix I invite you to test in the Importer latest snapshot. Please confirm.
Thank you so much! I am re-running the crawler with the latest importer snapshot installed. I will keep you posted on the results. Thanks.
The latest update fixed the issue! Thanks again for your help :)
@jmrichardson , would you be willing to share the triggering file with me personally. I'd like to dig in to see if this is a problem with Apache POI or if @essiembre 's fix is the best we can do.
@tballison , yes I can share with you. Please send me your email address and will send it over (or any way you prefer)
tallison AT mitre . org There's a chance I won't have time to look until the new year, but it would be helpful for figuring out if we need to fix something in POI. Thank you!
I am receiving many errors in what looks like files which have an embedded file. In my case, .msg files (exchange messages) containing attachments. Files (.msg) without attachments appear to be imported. The errors I am receiving are similar to:
Looking through other user issues, it appears to be related to #67 . However, this doesn't appear to be a valid configuration for the filesystem crawler:
<documentFetcher detectContentType="true" detectCharset="true"/>
When looking at the error directory with the associated meta.txt, it doesn't appear to have determined the content-type or content-encoding:
I have also tried different variations of the documentParserFactory setting for a .msg containg a .msg but no luck there:
Thank you for your help