I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.
<!DOCTYPE html>
<html><head>
<title>title</title>
<meta charset="utf-8" />
</head><body>
This is déjà vu …
</body></html>
… being converted beforehand to …doc.summary().encode('raw_unicode_escape').decode('utf-8') returns This is déjà vu \u2026
It is very common to have both non-ascii utf-8 and HTML entities together.
As the output is HTML anyway, leaving entities unprocessed could be a solution.
I am using the latest pip version - readability-lxml 0.8.1 and I found a curious issue. When there are both non-ascii utf-8 chars and HTML entities the output is not properly utf-8 encoded.
doc.summary()
returnsThis is déjà vu …
…
being converted beforehand to…
doc.summary().encode('raw_unicode_escape').decode('utf-8')
returnsThis is déjà vu \u2026
It is very common to have both non-ascii utf-8 and HTML entities together. As the output is HTML anyway, leaving entities unprocessed could be a solution.
Thank you.