aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.61k stars 412 forks source link

Not catching HTMLParser.HTMLParseError from unclosed tag #10

Open turbodog opened 13 years ago

turbodog commented 13 years ago

>>>import html2text

>>>html2text.__version__

'3.02'

>>>html2text.html2text('<p')

Traceback (most recent call last):

File "", line 1, in

File "html2text.py", line 450, in html2text

return optwrap(html2text_file(html, None, baseurl))

File "html2text.py", line 447, in html2text_file

return h.close()

File "html2text.py", line 185, in close

HTMLParser.HTMLParser.close(self)

File "C:\dev\python27\lib\HTMLParser.py", line 112, in close

self.goahead(1)

File "C:\dev\python27\lib\HTMLParser.py", line 164, in goahead

self.error("EOF in middle of construct")

File "C:\dev\python27\lib\HTMLParser.py", line 115, in error

raise HTMLParseError(message, self.getpos())

HTMLParser.HTMLParseError: EOF in middle of construct, at line 1, column 1

Is it possible to append a '>' or drop the whole tag and retry without passing the exception up?

FWIW, first parsing the HTML with BeautifulSoup eliminates the unclosed tag and html2text suceeds.

jlward commented 11 years ago

I saw this as well. This issue is fixed in python2.7 because of an update in HTMLParser. You should be able to backport the 2.7 version to fix this issue.

jlward commented 11 years ago

I created a backport and put it on Pypi. https://pypi.python.org/pypi/HTMLParser