aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.57k stars 410 forks source link

Extra '\' slash appear before '-' and '.' #111

Open Jerry-Ku opened 6 years ago

Jerry-Ku commented 6 years ago

Extra slash was added in front of output when two and above '-' were encountered. eg. echo '\<p>-\</p> | html2text -> '-' echo '\<p>--\</p> | html2text -> '\--' Also, if the input string format is '[0-9].[space]', the output will be '[0-9]. ', eg. echo '\<p>.\</p> -> '.' echo '\<p>..\</p> -> '..' echo '\<p>2.\</p> -> '2.' echo '\<p>2. \</p> -> '2\. ' echo '\<p>a. \</p> -> 'a. '

bubalopetar commented 2 years ago

Issue happens at utils.py package file (Python37\Lib\site-packages\html2text\utils.py) at lines 210, 211, 212. Here are those lines that work: text = config.RE_MD_DOT_MATCHER.sub(r"\1\2", text) text = config.RE_MD_PLUS_MATCHER.sub(r"\1\2", text) text = config.RE_MD_DASH_MATCHER.sub(r"\1\2", text)

These lines originally have 2 extra backslashes, just replacing these 3 lines should fix this issue. Not sure if it could break something else.