Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Encoding correctly detected but not taken into account when committing #202

Closed niels closed 8 years ago

niels commented 8 years ago

When I crawl a non-Unicode document (or more precisely: a document in a charset other than my platform default), the crawler correctly detects the document's encoding (by inspecting the "Content-Type" HTTP response header) and adds that to the document's metadata. During further processing however, the per-document encoding is ignored and the platform default is used instead.

This often times leads to incorrect encoding being applied as e.g. the filesystem and elasticsearch committers persist in the platform default encoding.

For example, consider https://herimedia.com/norconex-test.html which is sent over the wire ISO-8859-1 encoded. The following configuration will crawl the document and save the raw HTML into ./crawled-files:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <maxDocuments>1</maxDocuments>
      <robotsTxt class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" ignore="true" />
      <sitemapResolverFactory class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" ignore="true" />

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./crawled-files</directory>
      </committer>

      <importer>
        <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
          <ignoredContentTypes>.*</ignoredContentTypes>
        </documentParserFactory>
      </importer>

      <startURLs stayOnDomain="true">
        <url>https://herimedia.com/norconex-test.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

If your platform / Java default encoding is UTF-8, this will mangle the file as follows:

snapshot31

By contrast, if we modify FileSystemCommitter slightly to take the previously detected encoding into account, the file gets written out correctly (irrespective of the platform encoding!):

snapshot32

While a good proof-of-concept, the above commit probably isn't suitable for merging as it pulls the entire document into memory (whereas previously it was streamed). It would also be better to apply this conversion in an earlier step so as to avoid having to fix each committer individually.

Unfortunately, I am not terribly familiar with Java IO handling and couldn't immediately come up with a proper solution myself. Discussion welcome!

Thanks.

essiembre commented 8 years ago

I figured out what is going on here. The conversion of content to UTF-8 is done by the Importer module when parsing documents to extract the text. In your case you are skipping that step with your <ignoredContentTypes> tag. This means the document content is saved as-is, without converting it to anything. Because the document is originally ISO-8859-1, that is how it is saved (it's saved as binary stream). If you open it as a ISO-8859-1 document, the characters will be just fine (I tested with an editor detecting encoding -- Notepad++). That's not a bad thing since it matches the declaration in your HTML.

If you enable the import module, the extracted content will be saved as UTF-8. One of the primary function of the importer module is to extract text from files (including binary files). If you want to keep the original HTML, then that's when you have the problem. I am not sure how best to address this. Would you want to keep the original file, but convert its encoding when it is a text file, but not when binary (because we cannot convert the charset used in all binary file... we have to extract the text first)? In that case the HTML-declared encoding will not match the actual encoding, causing other issues for other people.

Ideas are welcome, but since 99% will likely not skip the importing phase, I am not sure a "fix" would be appropriate. Maybe some optional feature somewhere?

What you are doing in the meantime definitely does it, as long as you can trust the charset specified in the HTML (provided it is always there).

niels commented 8 years ago

Well, that makes sense.. I was so fixated on our use-case (plain-text HTML) that I completely forgot that the collector could just as well be used to pull in e.g. PDFs or other binaries.

As you said, a change earlier in the pipeline thus seems inappropriate. (Though perhaps the documentation could be expanded slightly to touch on the issue of character encoding. I am thinking of an addition to the GenericDocumentParserFactory documentation expanding on the warnings under the "Ignoring content types" heading).

I will instead look into using a customized GenericDocumentParserFactory / IDocumentParserFactory implementation. Perhaps I can still pipe the document through Tika but somehow get the entire HTML back. I'll check it out and report back here in case anyone is following along.

essiembre commented 8 years ago

Good idea to enhance on the documentation. Because you may not be the only one in the universe to bypass parsing, would there be value in creating a new document transformer (in the Importer module) that would detect encoding (using Tika) on text files only convert them to UTF-8? It would be relevant only when used as a pre-parse handler or when parsing is skipped for a text file. If you think it could be a good approach, should it take over the feature request for #194?

niels commented 8 years ago

We would definitely use such a feature, but I don't know if it is worth your time to implement it. It might be one of the more exotic use cases.

In addition however, I wonder how the encoding is handled for the "keep downloaded files" feature. If this behaves the same way as committing unparsed files – i.e. if it ignores the Content-Type header –, then that would perhaps be a more important reason to add the transformer that you have mentioned.

While I fully appreciate why in our specific configuration (commit unparsed files), the encoding instruction provided in meta information is not taken into account, from a feature such as "keep downloaded files" I would in fact expect comparable behaviour to downloading files straight from a browser. That is, I would expect them to be saved as per the encoding information provided in the HTTP headers. In my opinion, when looking at a "kept" downloaded file, it should not be necessary to also take into account the HTTP headers. That is, the kept file should be readable sensibly on its own.

Regarding the feature request in #194 (force an encoding per-crawler), that might still make sense for servers that don't behave correctly. But I haven't yet personally run into one of those. So I'd say: if there is a more dynamic way to handle encoding (such as through a new transformer), a config override isn't necessary unless someone can provide a real-world example of where it would be necessary.

essiembre commented 8 years ago

I have implemented a new CharsetTransformer in the Importer module that will detect the current character encoding of a page and convert it (if needed) to the target encoding you specify. It is available in the latest snapshot.

The javadoc mentions when best to use it (or not use it). Because you skip parsing, this new class is probably what you need. I would restrict it to only relevant content types only. For example:

<transformer class="com.norconex.importer.handler.transformer.impl.CharsetTransformer"
    targetCharset="UTF-8">
    <restrictTo field="document.contentType">
        text/.*
    </restrictTo>
</transformer>

Let me know how that works for you.

niels commented 8 years ago

This looks promising, but the transformer's output of small documents is not detected by the importer:

DEBUG [CharsetTransformer] Detected encoding: ISO-8859-1
DEBUG [Importer] Transformer "class com.norconex.importer.handler.transformer.impl.CharsetTransformer" did not return any content for: https://herimedia.com/norconex-test.html
$ cat crawled-files/**/*.cntnt
<!DOCTYPE html>
<html lang="en" class=" is-copy-enabled">
  <head>
    <meta charset="ISO-8859-1">

    <title>N�rc���x Test Site</title>
  </head>

  <body>
    N�rc���x Test Site
  </body>
</html>

I ran the test using the configuration from the top of this issue with the transformer having been added as follows:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <maxDocuments>1</maxDocuments>
      <robotsTxt class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" ignore="true" />
      <sitemapResolverFactory class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" ignore="true" />

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./crawled-files</directory>
      </committer>

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.CharsetTransformer" />
        </preParseHandlers>

        <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
          <ignoredContentTypes>.*</ignoredContentTypes>
        </documentParserFactory>
      </importer>

      <startURLs stayOnDomain="true">
        <url>https://herimedia.com/norconex-test.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

This seems to be an issue of output stream buffering / flushing / caching because for larger documents, the transformation is actually applied. The following configuration illustrates the problem (at least on my system, not sure if / to what degree the buffering is platform-dependent):

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <maxDocuments>2</maxDocuments>
      <robotsTxt class="com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider" ignore="true" />
      <sitemapResolverFactory class="com.norconex.collector.http.sitemap.impl.StandardSitemapResolverFactory" ignore="true" />

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./crawled-files</directory>
      </committer>

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.CharsetTransformer" />
        </preParseHandlers>

        <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
          <ignoredContentTypes>.*</ignoredContentTypes>
        </documentParserFactory>
      </importer>

      <startURLs stayOnDomain="true">
        <url>https://herimedia.com/norconex-test.html</url>
        <url>https://herimedia.com/norconex-test-large.html</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

https://herimedia.com/norconex-test-large.html will be transformed correctly, whereas https://herimedia.com/norconex-test.html won't be transformed at all ("class com.norconex.importer.handler.transformer.impl.CharsetTransformer" did not return any content for: https://herimedia.com/norconex-test.html).

I did try adding a output.flush(); after https://github.com/Norconex/importer/blob/e7e650b15f6d2d8bed8ede70dc337a3789e75e1f/norconex-importer/src/main/java/com/norconex/importer/handler/transformer/impl/CharsetTransformer.java#L134 but it made no difference.

essiembre commented 8 years ago

It is using Apache Tika charset detection mechanism, which, like any charset detector cannot address 100% of use cases. The smaller the text the higher the chances of false detection. Because it is not rare for all HTML from the same site to share the same encoding, I will modify it so it is possible to specify an explicit source charset. That is probably the best that could be done when the charset is known. If the charset is mixed for a site and many pages are small, I am not sure what other alternative there might be to increase accuracy.

essiembre commented 8 years ago

I fixed it and the charset detection now works fine with short content (tested using your sample). I added a new optional sourceCharset in case you know upfront the charset of documents. Please try the latest snapshot and let me know.

niels commented 8 years ago

Oh, I see. I didn't know that the CharsetTransformer was using heuristics and instead expected it to (1) use Tika's HtmlEncodingDetector, (2) look at the Content-Type response header or (3) fall back to the platform default (in that order) when trying to determine the source encoding.

Re (1), I must again have been overly fixated on the HTML use case.

Regardless, I will test the latest snapshot on Wed. Thank you!

essiembre commented 8 years ago

If not mistaken, it does use Tika encoding detector, which relies on the content type and header information for encoding when supplied. The problem was not there, it was with a specific InputStream implementation I used which was not handling the reset() method properly. This caused the short content not to be read at all by the detector (as you reported). The issue still exists in that InputStream implementation, but I was able to find a work around so the document content is now always read properly no matter what.

niels commented 8 years ago

Sorry for the delay in re-testing this. The issue has indeed been resolved! :sparkles:

essiembre commented 8 years ago

Thanks for confirming. Thanks to you, I just discovered Github supports emojis! :stuck_out_tongue_winking_eye: