Not catching HTMLParser.HTMLParseError from unclosed tag

turbodog commented 13 years ago

>>>import html2text

>>>html2text.__version__

'3.02'

>>>html2text.html2text('<p')

Traceback (most recent call last):

File "", line 1, in

File "html2text.py", line 450, in html2text

return optwrap(html2text_file(html, None, baseurl))

File "html2text.py", line 447, in html2text_file

return h.close()

File "html2text.py", line 185, in close

HTMLParser.HTMLParser.close(self)

File "C:\dev\python27\lib\HTMLParser.py", line 112, in close

self.goahead(1)

File "C:\dev\python27\lib\HTMLParser.py", line 164, in goahead

self.error("EOF in middle of construct")

File "C:\dev\python27\lib\HTMLParser.py", line 115, in error

raise HTMLParseError(message, self.getpos())

HTMLParser.HTMLParseError: EOF in middle of construct, at line 1, column 1

Is it possible to append a '>' or drop the whole tag and retry without passing the exception up?

FWIW, first parsing the HTML with BeautifulSoup eliminates the unclosed tag and html2text suceeds.

jlward commented 11 years ago

I saw this as well. This issue is fixed in python2.7 because of an update in HTMLParser. You should be able to backport the 2.7 version to fix this issue.

jlward commented 11 years ago

I created a backport and put it on Pypi. https://pypi.python.org/pypi/HTMLParser

aaronsw / html2text

Not catching HTMLParser.HTMLParseError from unclosed tag #10