aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.61k stars 412 forks source link

Unicode Decode Error: #18

Open GgVvTt opened 13 years ago

GgVvTt commented 13 years ago

When I tried: python3.2 html2text.py "http://salonkritik.net/"

I got an error: Traceback (most recent call last): File "html2text.py", line 478, in data = text.decode(encoding) UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 2624: invalid continuation byte

The website is in Spanish, encoding supposedly iso-8859-1, which may be the cause of the issue.

I added 'ignore' to line 478 to ignore errors, and it seems like it works. But, I'm a still a newbie at python and I'm not sure what chaos (if any) might be wrought by ignoring encoding errors.

 if encoding == 'us-ascii':
                encoding = 'utf-8'
        data = text.decode(encoding, 'ignore')
stefanor commented 12 years ago
$ curl -s http://salonkritik.net/ | isutf8
stdin: line 85, char 1, byte offset 30: invalid UTF-8 code
$ curl -s -I http://salonkritik.net/ | grep Content-Type
Content-Type: text/html
$ curl -s http://salonkritik.net/ | grep Content-Type
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

So the site doesn't claim to be UTF-8, we are just assuming it, because feedparser thinks it's us-ascii. The bug here is in feedparser (or in html2text trusting feedparser to guess the content type). Passing "iso-8859-1" as the optional encoding parameter will easily work around this for this particular site

leesei commented 11 years ago

It is a little bit more complicated:

In respond to GgVvTt's question, as stefanor mentioned, you HAVE TO specify an encoding to get rid of the error: python html2text.py "http://salonkritik.net/" "iso-8859-1"

mcepl commented 10 years ago

There are some things I would note about this issue:

  1. feedparser._getCharacterEncoding is gone and we should get rid of it (and we will have one less dependency, yay!). Using private method of any package was never a bright idea anyway.
  2. default character encoding for HTTP is latin-1, not us-ascii. In this Aaron presented himself as a typical American, I am afraid.
  3. I don't know if there is somewhere in stdlib something equivalent to the below show function get_char_encoding. If there is, we should certainly use it. Otherwise, I believe my function could be a pretty reasonable resolution of the situation (BTW, if you run the script, you find out that mostly utf-8 already won, so this issue is going to be less and less important).
  4. If checking charset parameter is not enough, we can go all the way to http://www.w3.org/International/questions/qa-html-encoding-declarations
#!/usr/bin/python3

import urllib.request
import cgi
import logging
logging.basicConfig(format='%(levelname)s:%(funcName)s:%(message)s',
                    level=logging.INFO)

def get_char_encoding(in_url):
    req = urllib.request.Request(url=in_url, method='HEAD')
    f = urllib.request.urlopen(req)

    if f.status == 200:
        ct_header = f.getheader('Content-Type')
        logging.debug('raw Content-Type header: {}'.format(ct_header))
        if ct_header is not None:
            _, encoding = cgi.parse_header(ct_header)
            logging.debug('encoding = {}'.format(encoding))

            if 'charset' in encoding:
                return encoding['charset'].lower()

    return 'iso-8859-1'

for url in ['http://salonkritik.net/',
            'http://www.ihned.cz',
            'http://www.w3.org',
            'http://www.yandex.ru/',
            'https://www.microsoft.co.jp',
            'http://hk.qq.com/',
            'http://www.haaretz.co.il/',
            'https://th.wikipedia.org/',
            'http://www.maxboard.co.kr/']:
    print('{} has encoding {}'.format(url, get_char_encoding(url)))