buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.67k stars 349 forks source link

Issue with utf8 and HTML entities #175

Closed uuencode closed 1 year ago

uuencode commented 1 year ago

I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.

<!DOCTYPE html>
<html><head>
<title>title</title>
<meta charset="utf-8" />
</head><body>
This is déjà vu &hellip;
</body></html>

doc.summary() returns This is déjà vu …


&hellip; being converted beforehand to doc.summary().encode('raw_unicode_escape').decode('utf-8') returns This is déjà vu \u2026


It is very common to have both non-ascii utf-8 and HTML entities together. As the output is HTML anyway, leaving entities unprocessed could be a solution.

Thank you.

uuencode commented 1 year ago

Found another issue that mentioned Document(response.content) should be used instead of Document(response.text) and that fixed it.

#163

A good idea to update the readme.

buriy commented 1 year ago

Thanks! Updated readme!