Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

'latin-1' codec can't encode character ... #314

Open glendeni opened 4 years ago

glendeni commented 4 years ago

using version downloaded today with Python 3.4.3 get

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 59766: ordinal not in range(256)

from

https://www.co.monterey.ca.us/government/departments-i-z/resource-management-agency/public-works/road-conditions-closures

adding --decode-errors=ignore gives same result

jdufresne commented 4 years ago

Thanks for the report.

Python 3.4 is end of life and no longer supported by the project. However, I tested anyway. The following command works for me using the latest commit on master:

$ curl https://www.co.monterey.ca.us/government/departments-i-z/resource-management-agency/public-works/road-conditions-closures | python -m html2text 

Are you using the same or something else? Can you retest using the master branch?

glendeni commented 4 years ago

Thanks for your reply. I'm using version 2020.1.16, which is what I obtained by running 'pip install html2text' yesterday so assume it is the latest version. Using the curl command with that html2text ala your example I still get the same error. [FWIW do not get that error for other webpages.]

If is works for you the problem must be on my end, so you might as well move on. I'm not a python user so don't have knowledge to figure out what is wrong on my end. I myself have moved on to instead use Debian html2text program (which despite same name I assume has a different source.)