TikaLinkExtractor does not handle base href tag.

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

TikaLinkExtractor does not handle base href tag. #335

Closed YoungDan closed 7 years ago

YoungDan commented 7 years ago

Hi, We have a problem indexing attachment binaries (pdf, word, ppxt) files. The files are in Swedish and need to be parsed by Tika, However, we use a standard of <base href="/baseurl/"> along with a link to the binary <a href="/path/to/binary">

When I use the GenericLinkExtractor with the base and a tags. I am able to extract the binaries but the content is not parsed in UTF-8. and special Swedish chars are not being indexed properly. So I tried to use the TikaLinkExtractor instead. With the TikaExtractor I am not able to index the binaries due to not being able to use the base and a tags. Is there any way of indexing the documents using both tika and the base href?

Here is my configuration for the extractors. `

PDF

    <!--extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8" >
        <tags>
            <tag name="base" attribute="href" />
            <tag name="a" attribute="href" />
        </tags>
  </extractor-->
  </linkExtractors>`

Am I missing anything? and What could it be?

essiembre commented 7 years ago

Default uses Tika for parsing your binary documents regardless which link extractor you use. The GenericLinkExtractor should be preferred. The problem seem to appear at the parsing level. Can you attach a file causing this problem?

YoungDan commented 7 years ago

Hi, These are the configurations that we are using. we have one custom tagger that, parses values from existing fields. I know that the binaries are parsed wrong before they enter the customTagger.
Any ideas?

config-sitemap.txt include-rules.txt post-parse-handlers.txt pre-parse-handlers.txt

essiembre commented 7 years ago

Please share a URL or document causing the problem. I cannot reproduce without that. Also, have you tried using the default document fetcher instead of PhantomJS to see if it makes a difference?

YoungDan commented 7 years ago

Hi, Here is an example 32-22922.pdf

I have tried to run the tika-app.jar on the dokument with great success. everything gets parsed correct. Thanks for helping me out! :+1:

essiembre commented 7 years ago

I tried crawling it on my end and it gets parsed properly. Do you have the actual URL to your document? I suspect your web server returns a bad character encoding for the file. The character encoding used is the one from the HTTP headers when provided. With the real URL to the document, I could confirm this.

YoungDan commented 7 years ago

Hi, Thanks so much, I checked the HTTP headers and it was just as you said, the encoding used was iso-8859-1, Where can I force the utf-8 encoding on the incoming url?

Thank you for Helping out!

essiembre commented 7 years ago

You have a couple options.

You can try the CharsetTransformer as a pre-parse handler.

Also, the latest snapshot release has new flags on the document fetcher that can force it to detect the encoding rather than trusting the HTTP response headers. It may give better results, but only testing would tell. Example:

  <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
      detectContentType="true" detectCharset="true"/>

These new flags are also available on the PhantomJS document fetcher.

Let me know if one of these options solves it for you.

essiembre commented 7 years ago

The new flags are available in stable 2.7.0 release.