Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

TikaException: TIKA-198 #41

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

There are a lot of this kind exceptions in log file.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: 
TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f8fdac3

...

Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; 
read 0x615C316674725C7B, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

Is it right to ask here? Or i should write open issue somewhere more?

essiembre commented 7 years ago

It could be an unsupported file format, a corrupted file, or else. Can you attach a file causing this exception?

aleha84 commented 7 years ago

Sent you an email. All listed there files opens normally, without errors.

essiembre commented 7 years ago

Which version are you using to test? I just tried with the latest Importer snapshot and I was able to parse the first 3 of the 5 files you sent me. I sent you an email with the parsed output.

I will investigate the other two further (eq-all-en.xls and metod-2013.doc).

essiembre commented 7 years ago

I found out why some are failing. You are using HTTP Collector and the content type from the HTTP response does not match the real content type of the file. For instance, your ...agreement.rtf file is identified as application/msword when returned by your web server. In reality the document is application/rtf.

The Importer module will try to guess the content type when it is not provided. It otherwise uses the provided one (from the HTTP response in your case). So you should really be looking at why your web server does not return the proper content type for some documents. Fixing this will fix many of your errors.

If you do not control the site or it is otherwise impossible for you to do, we can make a feature request to always "guess" the content type instead of trusting the web server. I am not sure whether this could cause additional issues though (when guessing is wrong).

essiembre commented 7 years ago

Your metod-2013.doc file fails due to a bug that will be fixed in next Tika release: https://issues.apache.org/jira/browse/TIKA-2198 Until it is released, I have integrated the single fix myself, found in latest snapshot version of the Importer.

Your eq-all-en.xls file is more tricky. I researched this specific exception and it seems to point to an issue with how the file was written, even if it can be opened properly by Excel. A known workaround is to save the file in the newer Microsoft format (.xlsx). I tested this and that works.

aleha84 commented 7 years ago

most interested in fixing org.apache.poi.poifs.filesystem.NotOLE2FileException. The rest are single cases.

feature request to always "guess" the content type instead of trusting the web server

It will be great. But could it be more flexible to compare content type sended by server and real content type? And if it differs use real insted of provided by server.

essiembre commented 7 years ago

What you're suggesting is equivalent to always using/guessing the real one. I will mark this as a feature request. I plan to add a flag to that effect.

essiembre commented 7 years ago

A new snapshot release of the HTTP Collector was made which now offers to detect the content type and character encoding instead of relying on the HTTP header response. It can be enabled like this (in your crawler section):

<documentFetcher detectContentType="true" detectCharset="true"/>

FYI, when used standalone, the Importer will always try to detect the content type and character encoding when not specified.

essiembre commented 7 years ago

Stable 2.7.0 was just released with this fix.

aleha84 commented 7 years ago

great, thx

ciroppina commented 5 years ago

for all interted people: Same exception (TikaException TIKA-198) appears with Norconex Collector 2.8.1, while trying to extract RTF and DOCX documents from imported/fetched .zip archives

SaschaHeyer commented 5 years ago

I can confirm the same behavior (org.apache.tika.exception.TikaException: TIKA-198) also for .jpg in .zip archives.

Norconex Collector 2.8.1

essiembre commented 5 years ago

You tried <documentFetcher detectContentType="true" detectCharset="true"/>? If so, can you confirm whether you are using pre-parse handlers? If so, do you make sure you are not performing text operations on binary files (using restrictTo)? If that is not your issue, please open a new ticket (since this one is closed) with details to reproduce.