Open LudovicRousseau opened 1 year ago
This is most likely a bug with lxml, please report it to the lxml project.
If I use the program ./nikola/bin/rst2html.py
(on macOS) to convert my .rst
post I have no problem.
I get:
[...]
<div class="document">
<!-- title: Unicode -->
<!-- slug: unicode -->
<!-- date: 2023-05-08 18:48:37 UTC+02:00 -->
<!-- tags: -->
<!-- category: -->
<!-- link: -->
<!-- description: -->
<!-- type: text -->
<p>😺</p>
</div>
</body>
</html>
I have no idea how lxml is used by Nikola. Can you provide a sample code using lxml that should fail so I can report the issue to lxml?
Sure, here’s some sample code:
import lxml.html
html = """<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""
parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(html, parser)
data = lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True, doctype='<!DOCTYPE html>')
print(data)
Can you reproduce the issue using this code on macOS? For reference, I get the following output on Windows and Linux:
b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8"></head>\n<body>\n<h1>Hello, world!</h1>\n<div>\n<p>\xf0\x9f\x98\xba</p>\n</div>\n</body>\n</html>\n'
Bingo! On macOS I get:
>>> print(data)
b'<!DOCTYPE html>\n<html><body><p>! D O C T Y P E h t m l > \n </p></body></html>\n'
I reported the lxml issue at https://bugs.launchpad.net/lxml/+bug/2019038 Thanks
Environment
Python Version: Python 3.11.3 (main, Apr 7 2023, 19:29:16) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin Installed from Homebrew
Nikola Version: Nikola 8.2.4
Operating System: macOS Monterey 12.6.5
Description:
If I use a Unicode character like a smiley or 😺 in a
.rst
page then the generated html is bogus. Sourceunicode.rst
page:The generated html page contains:
And in the browser I see: "h t m l > " for the content of the post.
I have no problem with another Unicode character like an accented letter like "è".
Debian is OK
I then tried the same manipulation on a Debian GNU/Linux version 12 (the next Debian stable) and I have no problem. On Debian I use:
In both cases I use a venv.
Debug
I tried to debug but I am new to Nikola. I tried
nikola rst2html
.On macOS I get:
Here again the result is correct if run on Debian.
Maybe the problem is in a dependency used by Nikola.