Closed YoungDan closed 7 years ago
Default uses Tika for parsing your binary documents regardless which link extractor you use. The GenericLinkExtractor should be preferred. The problem seem to appear at the parsing level. Can you attach a file causing this problem?
Hi,
These are the configurations that we are using. we have one custom tagger that, parses values from existing fields. I know that the binaries are parsed wrong before they enter the customTagger.
Any ideas?
config-sitemap.txt include-rules.txt post-parse-handlers.txt pre-parse-handlers.txt
Please share a URL or document causing the problem. I cannot reproduce without that. Also, have you tried using the default document fetcher instead of PhantomJS to see if it makes a difference?
Hi, Here is an example 32-22922.pdf
I have tried to run the tika-app.jar on the dokument with great success. everything gets parsed correct. Thanks for helping me out! :+1:
I tried crawling it on my end and it gets parsed properly. Do you have the actual URL to your document? I suspect your web server returns a bad character encoding for the file. The character encoding used is the one from the HTTP headers when provided. With the real URL to the document, I could confirm this.
Hi, Thanks so much, I checked the HTTP headers and it was just as you said, the encoding used was iso-8859-1, Where can I force the utf-8 encoding on the incoming url?
Thank you for Helping out!
You have a couple options.
You can try the CharsetTransformer as a pre-parse handler.
Also, the latest snapshot release has new flags on the document fetcher that can force it to detect the encoding rather than trusting the HTTP response headers. It may give better results, but only testing would tell. Example:
<documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
detectContentType="true" detectCharset="true"/>
These new flags are also available on the PhantomJS document fetcher.
Let me know if one of these options solves it for you.
The new flags are available in stable 2.7.0 release.
Hi, We have a problem indexing attachment binaries (pdf, word, ppxt) files. The files are in Swedish and need to be parsed by Tika, However, we use a standard of
<base href="/baseurl/">
along with a link to the binary<a href="/path/to/binary">
When I use the GenericLinkExtractor with the base and a tags. I am able to extract the binaries but the content is not parsed in UTF-8. and special Swedish chars are not being indexed properly. So I tried to use the TikaLinkExtractor instead. With the TikaExtractor I am not able to index the binaries due to not being able to use the base and a tags. Is there any way of indexing the documents using both tika and the base href?
Here is my configuration for the extractors. `
Am I missing anything? and What could it be?