aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.63k stars 414 forks source link

Improve "r_unescape" regular expression to skip invalid HTML entities #98

Open stkao05 opened 9 years ago

stkao05 commented 9 years ago

Some invalid HTML entities (ex: &#a;) are still being matched by the regular expression r_unescape, which result in error

Example scenario

html = "<html><body><input name='opt in for&#a;todoist.com&#a;new site' /><p>hihi</p><body></html>"

plaintext = html2text.html2text(html)

Error traceback:

  File "todoist/scripts/test.py", line 16, in <module>
    plaintext = html2text.html2text(html)
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 812, in html2text
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 252, in handle
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 249, in feed
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 715, in unescape
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 710, in replaceEntities
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 685, in charref
ValueError: invalid literal for int() with base 10: 'a'