when the code parse html code like:
June, 2016
program will throw IndexError exception.
I find this bug in the implement of handle_charref.
In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data,
+++++++++++++++++++++++++++++++++++
elif (self.preceding_stressed
and re.match(r'[^\s.!?]', data[0])
and not hn(self.current_tag)
and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
File "get_email.py", line 37, in
text = h.handle(mail_content_string) # html格式 转成 markdown 格式
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle
self.feed(data)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed
HTMLParser.HTMLParser.feed(self, data)
File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
self.goahead(0)
File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
self.handle_charref(name)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref
self.handle_data(self.charref(c), True)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data
and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range
when the code parse html code like: June, 2016 program will throw IndexError exception. I find this bug in the implement of handle_charref.
In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty). So, when program match the zero element of lrm and rlm character data, +++++++++++++++++++++++++++++++++++ elif (self.preceding_stressed and re.match(r'[^\s.!?]', data[0]) and not hn(self.current_tag) and self.current_tag not in ['a', 'code', 'pre']): +++++++++++++++++++++++++++++++++++ This is traceback: Traceback (most recent call last): File "get_email.py", line 37, in
text = h.handle(mail_content_string) # html格式 转成 markdown 格式
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle
self.feed(data)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed
HTMLParser.HTMLParser.feed(self, data)
File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
self.goahead(0)
File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
self.handle_charref(name)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref
self.handle_data(self.charref(c), True)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data
and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range