Open ahtimsir opened 4 years ago
Thanks for the report.
Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)?
So it does render as the replacement character in the browser, but because I'm using html2text to get 3rd party website content and I do run into this case where I got a character like that.
-TC
On Jan 9, 2020, 5:44 PM -0800, Jon Dufresne notifications@github.com, wrote:
Thanks for the report. Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Note that this affects the command line interface: echo '��' | html2text
.
Browsers immediately parse ��
into two replacement characters (source); we should do the same.
If the number is a surrogate, then this is a surrogate-character-reference parse error. Set the character reference code to 0xFFFD.
html2text --version
: 2019.9.26python --version
: 3.7.3Somewhat related to https://github.com/Alir3z4/html2text/issues/288 ...
When the HTML uses UTF-16 encoding for emoji, the result from
handle()
is not printable: