Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.74k stars 266 forks source link

Semicolon in Text with &#. #369

Open radze90 opened 2 years ago

radze90 commented 2 years ago

html2text version 2020.1.16 Python version 3.9.5

import html2text

test = html2text.HTML2Text()
text = test.handle("<p>Sample text K&N. Sample text.</p>")
print(text)

output: Sample text K&N.; Sample text.

Hi,

I noticed that the module inserts a simcolon in the text when converting a certain string, which I don't want. It doesn't matter which character comes after the &. Is this intentional and is it possible to work around this or is this a bug?

MonkzCode commented 4 months ago

I confirm this issue.

My sample:

ZZZ
ZZ&Z
ZZ#Z
https://some.site.com/index.php?r=billMail/confirmNewBillMail&code=pYgJeYbpnSsaGdSRoKgfa9bd0fb4248dbb437c745afbb6d1b29tvPsONXEQApNxxswCSZ

Output after html2text:

ZZZ
ZZ&Z;
ZZ#Z
https://some.site.com/index.php?r=billMail/confirmNewBillMail&code;=pYgJeYbpnSsaGdSRoKgfa9bd0fb4248dbb437c745afbb6d1b29tvPsONXEQApNxxswCSZ

@Alir3z4 please, fix this. We cant use html2text to parse URLs since html2text add semicolon into URL.