UnicodeEncodeError "surrogates not allowed" raised when converting utf-16-encoded emoji

ahtimsir commented 4 years ago

Version by html2text --version: 2019.9.26
Test script
Python version python --version: 3.7.3

Somewhat related to https://github.com/Alir3z4/html2text/issues/288 ...

When the HTML uses UTF-16 encoding for emoji, the result from handle() is not printable:

>>> import html2text
>>> plead_face_html = '&#55358;&#56698;'
>>> html = html2text.HTML2Text()
>>> html.handle(plead_face_html)
'\ud83e\udd7a\n\n'
>>> print(html.handle(plead_face_html))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

jdufresne commented 4 years ago

Thanks for the report.

Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)?

ahtimsir commented 4 years ago

So it does render as the replacement character in the browser, but because I'm using html2text to get 3rd party website content and I do run into this case where I got a character like that.

-TC

On Jan 9, 2020, 5:44 PM -0800, Jon Dufresne notifications@github.com, wrote:

Thanks for the report. Do you have a complete HTML document that renders correctly in a browser? Or are you expecting this to render as � (replacement character)? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Alir3z4 / html2text

UnicodeEncodeError "surrogates not allowed" raised when converting utf-16-encoded emoji #310