Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.85k stars 283 forks source link

UnicodeEncodeError "surrogates not allowed" raised when converting utf-16-encoded emoji #310

Open ahtimsir opened 4 years ago

ahtimsir commented 4 years ago

Somewhat related to https://github.com/Alir3z4/html2text/issues/288 ...

When the HTML uses UTF-16 encoding for emoji, the result from handle() is not printable:

>>> import html2text
>>> plead_face_html = '��'
>>> html = html2text.HTML2Text()
>>> html.handle(plead_face_html)
'\ud83e\udd7a\n\n'
>>> print(html.handle(plead_face_html))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
jdufresne commented 4 years ago

Thanks for the report.

Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)?

ahtimsir commented 4 years ago

So it does render as the replacement character in the browser, but because I'm using html2text to get 3rd party website content and I do run into this case where I got a character like that.

-TC

On Jan 9, 2020, 5:44 PM -0800, Jon Dufresne notifications@github.com, wrote:

Thanks for the report. Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

andersk commented 1 month ago

Note that this affects the command line interface: echo '&#55358;&#56698;' | html2text.

Browsers immediately parse &#55358;&#56698; into two replacement characters (source); we should do the same.

If the number is a surrogate, then this is a surrogate-character-reference parse error. Set the character reference code to 0xFFFD.