Closed alex-kozlowski-maven closed 5 years ago
In order to reproduce, can you share the faulty document and/or URL to it?
I am crawling an intranet, however I found a sample docx on the internet that is also experiencing this issue. "http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx" Thanks
I tried with that document and it worked just fine for me. The only difference is I did not have your custom BinaryContentTagger
. If you try without it, does it get parsed properly?
Looks like the problem was a lib dependency version issue. I had bad 3rd party libs. Thank you so much for the assistance.
When trying to run the crawler on an intranet I am getting:
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser
I have tried to import a specific file independently and it is fine. However when I let the crawler try to import the document it fails. I have also added:
<documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher" detectContentType="true" detectCharset="true"/>
and is seems to have no effect.Here is my config file: ` <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml>
`
Here is the debug text from the crawl:
I am at a bit of a loss as to what to try next.
Thanks