Closed aleha84 closed 7 years ago
It could be an unsupported file format, a corrupted file, or else. Can you attach a file causing this exception?
Sent you an email. All listed there files opens normally, without errors.
Which version are you using to test? I just tried with the latest Importer snapshot and I was able to parse the first 3 of the 5 files you sent me. I sent you an email with the parsed output.
I will investigate the other two further (eq-all-en.xls and metod-2013.doc).
I found out why some are failing. You are using HTTP Collector and the content type from the HTTP response does not match the real content type of the file. For instance, your ...agreement.rtf
file is identified as application/msword
when returned by your web server. In reality the document is application/rtf
.
The Importer module will try to guess the content type when it is not provided. It otherwise uses the provided one (from the HTTP response in your case). So you should really be looking at why your web server does not return the proper content type for some documents. Fixing this will fix many of your errors.
If you do not control the site or it is otherwise impossible for you to do, we can make a feature request to always "guess" the content type instead of trusting the web server. I am not sure whether this could cause additional issues though (when guessing is wrong).
Your metod-2013.doc
file fails due to a bug that will be fixed in next Tika release: https://issues.apache.org/jira/browse/TIKA-2198
Until it is released, I have integrated the single fix myself, found in latest snapshot version of the Importer.
Your eq-all-en.xls
file is more tricky. I researched this specific exception and it seems to point to an issue with how the file was written, even if it can be opened properly by Excel. A known workaround is to save the file in the newer Microsoft format (.xlsx). I tested this and that works.
most interested in fixing org.apache.poi.poifs.filesystem.NotOLE2FileException. The rest are single cases.
feature request to always "guess" the content type instead of trusting the web server
It will be great. But could it be more flexible to compare content type sended by server and real content type? And if it differs use real insted of provided by server.
What you're suggesting is equivalent to always using/guessing the real one. I will mark this as a feature request. I plan to add a flag to that effect.
A new snapshot release of the HTTP Collector was made which now offers to detect the content type and character encoding instead of relying on the HTTP header response. It can be enabled like this (in your crawler section):
<documentFetcher detectContentType="true" detectCharset="true"/>
FYI, when used standalone, the Importer will always try to detect the content type and character encoding when not specified.
Stable 2.7.0 was just released with this fix.
great, thx
for all interted people: Same exception (TikaException TIKA-198) appears with Norconex Collector 2.8.1, while trying to extract RTF and DOCX documents from imported/fetched .zip archives
I can confirm the same behavior (org.apache.tika.exception.TikaException: TIKA-198) also for .jpg in .zip archives.
Norconex Collector 2.8.1
You tried <documentFetcher detectContentType="true" detectCharset="true"/>
? If so, can you confirm whether you are using pre-parse handlers? If so, do you make sure you are not performing text operations on binary files (using restrictTo
)? If that is not your issue, please open a new ticket (since this one is closed) with details to reproduce.
There are a lot of this kind exceptions in log file.
Is it right to ask here? Or i should write open issue somewhere more?