Closed felixfontein closed 9 years ago
Minimal example to reproduce (with lxml directly):
import lxml.html
data = """<!DOCTYPE html>
<html>
<head>
<title>1</title>
</head>
<body>
<video>
<source src="1.ogv" type="video/ogg">
</video>
</body>
</html>"""
parser = lxml.html.HTMLParser()
doc = lxml.html.document_fromstring(data, parser)
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc, method='html')
result = data.decode('utf-8')
assert result == """<!DOCTYPE html>
<html>
<head>
<title>1</title>
</head>
<body>
<video>
<source src="1.ogv" type="video/ogg">
</source></video>
</body>
</html>"""
tidy5 fixes this.
The source element is also used with the audio and picture elements.
The problem seems to be that the HTML tag list of libxml2, which is used by lxml, is stuck to HTML 4.
Looks like lxml devs are aware of this:
maybe we could ask for help on their mailing list?
@Kwpolska looks like @felixfontein already asked :-)
https://mailman-mail5.webfaction.com/pipermail/lxml/2015-May/007524.html
Found a way!
>>> import html5lib
>>> lxml_etree_document = html5lib.parse("""<!DOCTYPE html>,
... <html>
... <head>
... <title>1</title>
... </head>
... <body>
... <video>
... <source src="1.ogv" type="video/ogg">
... </video>
... </body>
... </html>""", treebuilder="lxml")
>>> import html5lib.serializer
>>> print html5lib.serializer.serialize(lxml_etree_document,tree="lxml")
<!DOCTYPE html>,
<title>1</title>
<video>
<source type=video/ogg src=1.ogv>
</video>
I am not modern enough to know if that is still valid html5 ;-)
A little more research and it does a nice job:
>>> print html5lib.serializer.serialize(doc, tree="lxml", format="html", omit_optional_tags=False)
<!DOCTYPE html><html><head>
<title>1</title>
</head>
<body>
<video>
<source type=video/ogg src=1.ogv>
</video>
</body></html>
Clearly I had screwed up the 1st time I tried.
>>> print html5lib.serializer.serialize(doc, tree="lxml", format="html", omit_optional_tags=True, strip_whitespace=True)
<!DOCTYPE html><head> <title>1</title> </head> <body> <video> <source type=video/ogg src=1.ogv> </video>
it even supports alphabetizing attributes, which @Aeyoun wanted IIRC
@ralsina: looks good, at least the try with omit_optional_tags=False
. Interestingly, both W3C validators accept the shorter fragments without <body>
, </body>
and </html>
as valid. Apparently, the closing tags have already been optional for some time (http://stackoverflow.com/questions/3008593/html-include-or-exclude-optional-closing-tags). Also, here: http://www.w3schools.com/htmL/html5_syntax.asp it is written that <body>
and <html>
can be omitted as well. Interesting; I really didn't knew that.
In case anyone cares, the detailed rules for HTML5 in which precise cases something can be omitted are here: http://www.w3.org/TR/html5/syntax.html#optional-tags
You loose XML parsability by dropping those tags. (Only doctype and the title element are strictly required by the spec.)
You already lose XML parsability by using HTML5, since closing some tags (such as <source>
) is illegal. Anyway, I'd prefer to have as many closing tags as possible (i.e. allowed).
We can trivially make this configurable if you guys tell me this serializer is worth using (I defer because I am no expert in this)
<!DOCTYPE html><html><head>
<title>1</title>
</head>
<body>
<video>
<source type=video/ogg src=1.ogv>
</video>
</body></html>
Are there really no quotes around video/ogg
and 1.ogv
? That’s ugly.
Well, that's legal as long as the attribute value has no " ' ` = < > or space...
It is legal, but it just looks bad. Besides, I’m not sure if we should change the entire rendering pipeline just to address one minor tag…
I’ve bashed html5lib over the head with the troublesome areas I know about (was a browser tester for five years, after all) and can’t immediately find anything wrong with it. I’d say it’s better† than the current solution and recommend that we proceed with html5lib.
I suggest providing two presets in an option (HTML_OUTPUT = "minified" | "xml-like"
) to make everyone stylistically happy. The two main choices boils down to minified versus XML(-like). Defaulting to "xml-like" will produce output similar to what Nikola does today.
Minified:
quote_attr_values=False
omit_optional_tags=True
minimize_boolean_attributes=True
(can break stringmatched JS/CSS, but nothing in Nikola uses that by default)strip_whitespace=True
alphabetical_attributes=True
escape_lt_in_attrs=True
(seen so many tagsoup readers matching against <
over the years so we should definitely always escape this, even though not required by spec)XML-like:
quote_attr_values=True
use_trailing_solidus=True
alphabetical_attributes=True
escape_lt_in_attrs=True
strip_whitespace=False
(because no understanding of xml:space nor CDATA)
† From glancing over their code, they seem to know what a bastard HTML (authors) can be and know how to deal with it. Solves problems I’ve worked around using the tidy5 filters (which is much slower than html5lib) and cough custom regex filters. Produces output that should yield better compression.
I like Aeyoun's proposal a lot. XML-like should be the default, since it is closest to what lxml outputs (if I'm not mistaken).
The main differences with "XML-like" from today’s output is attribute sorting and consistent trailing solidus. I also agree that XML-like should be default, even though I’m likely to opt for minified myself. (Got to save those bytes.)
If we refactor "write this tree to a file" into a separate function (which we should anyway, because it's duplicated) we can make this all work easy and improve the code quality and test coverage at the same time.
I have not tried it but we may want to keep using lxml's parser and just use html5lib's serializer.
If I were to make the final decision, I would not give users a choice (since less than 10 people are going to be aware of the option’s existence, let alone know the difference and actually care about that) and instead pick a good compromise of the options. I would personally opt for minified + quote_attr_values=True
, but that’s only my opinion.
Blah, html5lib crashes if the LXML tree contains a ProcessingInstruction node :-(
It looks like a bug (or rather a missing case) in their lxmltree.TreeWalker.getNodeDetails
I am so not going to shave that yak...
For an example failure, check the enter-html5lib branch with the demo site's charts
story:
$ cd cache/posts/
$ python
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml, html5lib
>>> doc=lxml.html.fromstring(open('charts.html').read())
>>> html5lib.serializer.serialize(doc,tree='lxml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/__init__.py", line 16, in serialize
return s.render(walker(input), encoding)
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 306, in render
return "".join(list(self.serialize(treewalker)))
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 196, in serialize
for token in treewalker:
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/filters/optionaltags.py", line 17, in __iter__
for previous, token, next in self.slider():
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
for token in self.source:
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/_base.py", line 144, in __iter__
details = self.getNodeDetails(currentNode)
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/lxmletree.py", line 150, in getNodeDetails
match = tag_regexp.match(ensure_str(node.tag))
File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/lxmletree.py", line 18, in ensure_str
return s.decode("utf-8", "strict")
AttributeError: 'builtin_function_or_method' object has no attribute 'decode'
Added two new filters so anyone can quickly test this against their sites: html5lib_minify and html5lib_xmllike.
Since this doesn't get rid of LXML, A filter is a very good idea, I say that's it.
If you use a
<video>
element with a<source>
tag in a post, theme, ..., Nikola generates a</source>
for every<source ...>
, which is not valid HTML5.This happens in the last few lines of
Nikola.render_template()
innikola/nikola.py
:This is a problem of lxml.html (or maybe of something they use), so nothing we can easily fix.
I'll try to investigate more when I'm more motivated and have some more time... (For my blog, I can "fix" this by replacing "" with "" in all .html files... That's no good solution in general, though...)