Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Best way to handle incorrect encoding information #110

Closed jetnet closed 4 years ago

jetnet commented 4 years ago

hello Pascal,

we have to crawl many web-servers and some of them provide incorrect encoding data, e.g.:

> GET / HTTP/1.1
> Host: www.zeichentrickserien.de
> User-Agent: HTTP-Collector
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 23 Apr 2020 15:41:46 GMT
< Server: Apache/2.4.38 (Debian) mod_fastcgi/mod_fastcgi-SNAP-0910052141 mod_fcgid/2.3.9 OpenSSL/1.1.1d mod_wsgi/4.6.5 Python/2.7
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=ISO-8559-1
<
<!DOCTYPE html>
<html lang="de">

<head>
   <meta charset="utf-8">

Importer module cannot parse it:

INFO -            REJECTED_ERROR: http://www.zeichentrickserien.de/ (java.nio.charset.UnsupportedCharsetException: ISO-8559-1)

or it could be vice versa: HTTP-header sets correct encoding, the page itself - a wrong one.

I was able to workaround one case by removing the wrong header during the pre-parse stage. But it was rather a hack. I'm wondering, if you could suggest a proper way to handle this.

Thank you!

essiembre commented 4 years ago

Yes, there are a few places where you can rely on detection by the crawler instead of relying on the HTTP response. The earliest place I would recommend is right when you fetch the document. Have a look at GenericDocumentFetcher and set the detectCharset attribute to true (applies to your HTTP Collector crawler config).

If you want more control and do an explicit conversion from one to another only for some documents, I invite you to look the Importer CharsetTagger and CharsetTransformer.

jetnet commented 4 years ago

Thanks for the hint with the detectCharset, I was not aware of that feature. I tried it, but got the same error as above.

essiembre commented 4 years ago

Can you share your config? There is something strange here, as ISO-8859-1 is part of Java standard charsets so it should be supported: https://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html

jetnet commented 4 years ago

uploaded: transfer.sh/q1lZb/zt.xml

essiembre commented 4 years ago

I was able to reproduce with your config and found the cause. I misread your detected charset. The page appears to be UTF-8 but the server returns Content-Type: text/html; charset=ISO-8559-1.

I first thought that was the ISO Latin 1 standard which is supported by Java, but that one rather is ISO-8859-1 (close, but two 8s instead of two 5s).

If you know the site owner, you may want to suggest fixing the charset in the HTTP response header.

In any case, I just created a new snapshot version of the HTTP Collector which will log a warning instead of rejecting the document. If you do not like the warnings, you can change the log level to ERROR in log4j.properties for com.norconex.collector.http.pipeline.importer.MetadataFetcherStage and com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.

Please confirm this new version works for you.

jetnet commented 4 years ago

Thank you! I'm testing - the crawler does not stop anymore, but it does not extract links from the root page. I'm looking into this. Will let you know how it goes. Thanks again!

jetnet commented 4 years ago

the new snapshot has no problems with other websites, but does not extract the links from the root page for this web-site only. There is no crawl-event URLS_EXTRACTED at all. The root page of the website contains some malformed <a ..> tags, but it should not be a problem. I tested the crawler with similar broken tags w/o problems. So, the only difference I see here - it the warning:

WARN  [DocumentFetcherStage] Unsupported character encoding "ISO-8559-1" defined in "Content-Type" HTTP response header.

Setting the debug level to all events does not give any hints. The crawler just does not extract links from the root page.

Could you please take a look, when you have time? I uploaded a simplified config: transfer.sh/cfraT/zt-simple.xml

essiembre commented 4 years ago

I get a Not Found error when trying to download your file. Can you please attach it here instead?

jetnet commented 4 years ago

uploaded again: 0x0.st/ipfR.xml

essiembre commented 4 years ago

Try the latest snapshot. The issue here was that charset AND content-type detection were failing because they are both in the same HTTP header that has the wrong charset. By default, the URL extractor expects text/html and a few equivalents. I updated it that auto-detection is done on the content when parsing fails from the HTTP headers. I tested with your page and it worked. Please confirm.

jetnet commented 4 years ago

Great support again! Thank you very much!