aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.61k stars 412 forks source link

ValueError: need more than 1 value to unpack #25

Closed eadmaster closed 12 years ago

eadmaster commented 12 years ago
> python html2text.py http://en.wikipedia.org/wiki/Python
Traceback (most recent call last):
  File "C:\SharedPrograms\html2text\html2text.py", line 758, in <module>
    wrapwrite(html2text(data, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 691, in html2text
    return optwrap(html2text_file(html, None, baseurl))
  File "C:\SharedPrograms\html2text\html2text.py", line 686, in html2text_file
    h.feed(html)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\PortableApps\Python\Python26\lib\HTMLParser.py", line 142, in goahead
    if i < j: self.handle_data(rawdata[i:j])
  File "C:\SharedPrograms\html2text\html2text.py", line 671, in handle_data
    self.style_def.update(dumb_css_parser(data))
  File "C:\SharedPrograms\html2text\html2text.py", line 177, in dumb_css_parser
    elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements])
  File "C:\SharedPrograms\html2text\html2text.py", line 165, in dumb_property_dict
    return dict([(x.strip(), y.strip()) for x, y in [z.split(':') for z in style.split(';')]]);
ValueError: need more than 1 value to unpack
pazz commented 12 years ago

I'll second that. happens on ubuntu 11.04 with python v2.7. the version linked here http://www.aaronsw.com/2002/html2text/ doesn't break on the same input, but has the same version string than current master.

stefanor commented 12 years ago

Two things: If I run the command you suggested, I get ERR_ACCESS_DENIED from the wikimedia squids. If I save that page with curl, and run the Debian/Ubuntu html2text (the binary is called html2markdown), it works fine.

Thu current git HEAD fails, though:

$ python html2text.py /tmp/python.html utf-8
Traceback (most recent call last):
  File "html2text.py", line 758, in <module>
    wrapwrite(html2text(data, baseurl))
  File "html2text.py", line 691, in html2text
    return optwrap(html2text_file(html, None, baseurl))
  File "html2text.py", line 686, in html2text_file
    h.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 109, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 145, in goahead
    if i < j: self.handle_data(rawdata[i:j])
  File "html2text.py", line 671, in handle_data
    self.style_def.update(dumb_css_parser(data))
  File "html2text.py", line 177, in dumb_css_parser
    elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements])
ValueError: need more than 1 value to unpack
aaronsw commented 12 years ago

Fixed in d0ac3cf. Thanks.

stefanor commented 12 years ago

Aaron: FWIW The issue in my comment is still present.

aaronsw commented 12 years ago

@stefanor: Fixed in 075f9b2.

stefanor commented 12 years ago

thanks :)

eadmaster commented 12 years ago

Any workaround for the ERR_ACCESS_DENIED error (besides downloading the page with curl)? I guess it may happens with other pages too...

stefanor commented 12 years ago

I'd assume they don't like the default user agent. It could use a user agent of 'html2text' or something...