Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

utf-8 encoding error in output text #288

Closed tommilligan closed 4 years ago

tommilligan commented 5 years ago

In the currently released version, there is a manual .encode wrapper before writing to stdout. This leads to occasional utf-8 related errors such as:

Traceback (most recent call last):
  File "/env/bin/html2text", line 10, in <module>
    sys.exit(main())
  File "/env/lib/python3.6/site-packages/html2text/cli.py", line 306, in main
    wrapwrite(h.handle(data))
  File "/env/lib/python3.6/site-packages/html2text/utils.py", line 191, in wrapwrite
    text = text.encode("utf-8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udda5' in position 43659: surrogates not allowed

This was fixed by b361467 but is not yet released. Any idea of when a new version might be cut?

tommilligan commented 4 years ago

@jdufresne ping

jdufresne commented 4 years ago

Do you have an example command or script to demonstrate this bug?

Any idea of when a new version might be cut?

I don't have permission to do releases, only merge PRs. Releases are handled by @Alir3z4.

tommilligan commented 4 years ago

Thanks anyway - I'll wait on a new release then.

I don't have a minimal example, but I can confirm it's fixed in master anyway, so this issue can be closed after cutting a new release.

Alir3z4 commented 4 years ago

I don't have permission to do releases, only merge PRs. Releases are handled by @Alir3z4.

If you could ping me, I'd be happy to set the permission on pypi ;)

jdufresne commented 4 years ago

@Alir3z4 That'd be great. My PyPI username is jdufresne. Thanks.

Alir3z4 commented 4 years ago

@Alir3z4 That'd be great. My PyPI username is jdufresne. Thanks.

Awesome, done ;)

jdufresne commented 4 years ago

I have released a new version with this fix included. If you continue to experience issues let us know. Thanks for the report!