buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.66k stars 348 forks source link

Try UnicodeDammit instead of using chardet library (or maybe combine the approaches). #42

Open buriy opened 11 years ago

Telofy commented 11 years ago

Can UnicodeDammit guess the incorrect encoding specified in the headers?

You mean guess the correct encoding if the wrong encoding or no encoding was specified in the HTTP headers? The problem with the encoding specification in the HTTP header is that people often forget to set it. The default according to the HTTP specification is ISO-8859-1, which is rarely the actual encoding especially when you’re dealing with non-English pages.

I’m using the requests library in just about all of my programs, which “correctly” assumes ISO-8859-1 in such cases. What my programs do to actually decode the page is that they confirm with the actual content-type header whether the charset was set explicitly or whether it was inferred. If it was inferred, they ignore it. Then they call UnicodeDammit with the encoded response as markup and the HTTP-level encoding in override_encodings. That library then tries countless sources of explicit encoding specifications in the (HTML or XML) content.

Two problems remain, however. If none of the tried encodings could decode the response, I’m still in luck because the program actually knows that the decoding was unsuccessful; if, however, one of the incorrect encodings was coincidentally able to decode the page, no decoding errors are raised but the text is garbled. Barring comparisons of language-specific character n-gram models, I can’t think of any automatic method to detect such errors.

After all the sources of explicit encoding specifications have been tried, UnicodeDammit will try to use cchardet if it is available or chardet if it is not. If none is available, it’ll skip that step, which is what I’m currently doing for mostly German and English content. So far no one has alerted me of any decoding errors, and we’re fetching some 100k pages per day, so it must be working pretty okay.

Here’s some similar code from a private project of mine: https://bitbucket.org/Telofy/resyndicator/src/b0fdce864919bbbf68561142442428e09fb26112/resyndicator/fetchers.py?at=master#cl-52. The wrapper (only needed anymore to raise the exception): https://bitbucket.org/Telofy/utilofies/src/4a1852218b7fe59fbb28fb1bad5a7b26be2a5a46/utilofies/bslib.py?at=master#cl-14.

ghost commented 11 years ago

I use this package to parse content in 20 languages, and had to write my own shim to ensure that I only feed unicode to the Document class.

I tried many ways to develop this shim, and finally found something that works across all tested languages

Attempt 1: Using chardet: Chardet worked very well for European languages, but fell short when it comes to CJK encodings which have a superset. Also, the amount of ASCII code in the first parts of HTML content throw it

Attempt 2 Reading headers and charset encoding declarations: I would look for these flags in the response headers and text, then decode. Unfortunately, people lie, especially CJK sites. Many Chinese/Korean sites would state that they use big5/utf-8 but not actually respond with that content. That, or the headers say 'utf-8' and the "charset encoding='big5"".

Attempt 3 Using UnicodeDammit: Unicode Dammit is pretty cool. It's aware that HTML tags/cruft need to be stripped before guessing encoding. And it tries many encodings before giving up. Unfortunately, it is useless for Korean pages that often feature broken encodings.

Final Solution So Far: I used the code in encoding.py to strip tags using regex. Then I ran cchardet (it's a bit faster) and have a list of superset encodings that I have encountered (for instance: 'gb18030' is a superset of 'gb2312') that are commonly "lied about" in the CJK space.

Because of this, I think the existing encoding support is a sound algorithm. It seems as though a common problem are these "lookalike" encodings. A call to cchardet instead of chardet could save time, and the alternate encodings list could be used

For instance, just replace detected encodings: 'big5' should be decoded with 'big5hkscs' 'gb2312' should be decoded with 'gb18030' 'ascii' should be decoded with 'utf-8' 'iso-8859-1' should be decoded with 'cp1252'

Telofy commented 11 years ago

Thanks, that’s very interesting! I should run a few more tests with cchardet and preprocessed HTML (with tags stripped); that’ll probably restore my confidence into that method. I’ve only been working with English and German sites so far—apart from some that may have found their way in by accident—and the only kind of “lie” that I often encountered in that area was when no HTTP encoding was specified.

Telofy commented 10 years ago

There was another problem with my approach, which forced me to reimplement some small parts of UnicodeDammit: https://bitbucket.org/Telofy/utilofies/src/0d8cdc3ae5a0a08e7fb5906d96f0d8e2284751d1/utilofies/bslib.py?at=master#cl-15.

The encoding problem was reported to me, which usually means that it must’ve occurred on a number of pages, and I vaguely remember that I already ran into it and solved it for the old UnicodeDammit sometime in 2011.

When a page is declared, say, UTF-8 consistently everywhere (or anywhere) but contains a single illegal byte sequence—for example in an HTML comment like this one—then the “correct” encoding, UTF-8, is discarded. Moreover, Windows-1252 is somehow able to decode it, so that all umlauts and ligatures are mucked up.

When all declared encodings fail, I now immediately fall back on forcing the first one of them. Only if no encodings at all were declared anywhere do I fall back on UTF-8. I hope this will alleviate the problem.

ghost commented 10 years ago

One thing that I have done recently with my implementation is using HTTP headers to try to catch encoding before sending it to readability. I check for 'Content-Type' header and use regex to check for charset. If that works, I decode to unicode then encode the text to UTF-8 and send it to Readability. The cool trick to detect UTF-8 catches it really quickly and all is well.

FWIW, here's some URLs that I use to test Readability:

Of note are the following:

The others have other issues with Readability, and I am always building this list to test any changes I make to readability against it

buriy commented 10 years ago

I believe now this looks more like a package to deal only with the document encoding, based on document text, meta encoding and HTTP responses, that readability module could import it and reuse. What do you think? Could you make such package?