aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.57k stars 410 forks source link

lrm character and rlm character throw exception #119

Open Insutanto opened 5 years ago

Insutanto commented 5 years ago

when the code parse html code like: ‎June, 2016 program will throw IndexError exception. I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty). So, when program match the zero element of lrm and rlm character data, +++++++++++++++++++++++++++++++++++ elif (self.preceding_stressed and re.match(r'[^\s.!?]', data[0]) and not hn(self.current_tag) and self.current_tag not in ['a', 'code', 'pre']): +++++++++++++++++++++++++++++++++++ This is traceback: Traceback (most recent call last): File "get_email.py", line 37, in text = h.handle(mail_content_string) # html格式 转成 markdown 格式 File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle self.feed(data) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed HTMLParser.HTMLParser.feed(self, data) File "/usr/lib64/python3.4/html/parser.py", line 165, in feed self.goahead(0) File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead self.handle_charref(name) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref self.handle_data(self.charref(c), True) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data and re.match(r'[^\s.!?]', data[0]) IndexError: string index out of range