Nikola makes HTML5 video source tags invalid

felixfontein commented 9 years ago

If you use a <video> element with a <source> tag in a post, theme, ..., Nikola generates a </source> for every <source ...>, which is not valid HTML5.

This happens in the last few lines of Nikola.render_template() in nikola/nikola.py:

parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(data, parser)
doc.rewrite_links(lambda dst: self.url_replacer(src, dst, context['lang']))
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True)

This is a problem of lxml.html (or maybe of something they use), so nothing we can easily fix.

I'll try to investigate more when I'm more motivated and have some more time... (For my blog, I can "fix" this by replacing "" with "" in all .html files... That's no good solution in general, though...)

felixfontein commented 9 years ago

Minimal example to reproduce (with lxml directly):

import lxml.html

data = """<!DOCTYPE html>
<html>
  <head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source src="1.ogv" type="video/ogg">
    </video>
  </body>
</html>"""

parser = lxml.html.HTMLParser()
doc = lxml.html.document_fromstring(data, parser)
data = b'<!DOCTYPE html>\n' + lxml.html.tostring(doc, method='html')
result = data.decode('utf-8')

assert result == """<!DOCTYPE html>
<html>
  <head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source src="1.ogv" type="video/ogg">
    </source></video>
  </body>
</html>"""

da2x commented 9 years ago

tidy5 fixes this.

The source element is also used with the audio and picture elements.

felixfontein commented 9 years ago

The problem seems to be that the HTML tag list of libxml2, which is used by lxml, is stuck to HTML 4.

ralsina commented 9 years ago

Looks like lxml devs are aware of this:

https://github.com/lxml/lxml/blob/f8bdb7d48b5cf4403aa6bbbf10f5f8da6a4a510e/src/lxml/tests/test_incremental_xmlfile.py#L342

Kwpolska commented 9 years ago

maybe we could ask for help on their mailing list?

ralsina commented 9 years ago

@Kwpolska looks like @felixfontein already asked :-)

https://mailman-mail5.webfaction.com/pipermail/lxml/2015-May/007524.html

ralsina commented 9 years ago

Found a way!

>>> import html5lib
>>> lxml_etree_document = html5lib.parse("""<!DOCTYPE html>,
... <html>
...   <head>
...     <title>1</title>
...   </head>
...   <body>
...     <video>
...       <source src="1.ogv" type="video/ogg">
...     </video>
...   </body>
... </html>""", treebuilder="lxml")
>>> import html5lib.serializer
>>> print html5lib.serializer.serialize(lxml_etree_document,tree="lxml")
<!DOCTYPE html>,

    <title>1</title>

    <video>
      <source type=video/ogg src=1.ogv>
    </video>

I am not modern enough to know if that is still valid html5 ;-)

ralsina commented 9 years ago

A little more research and it does a nice job:

>>> print html5lib.serializer.serialize(doc, tree="lxml", format="html", omit_optional_tags=False)
<!DOCTYPE html><html><head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source type=video/ogg src=1.ogv>
    </video>

</body></html>

Clearly I had screwed up the 1st time I tried.

>>> print html5lib.serializer.serialize(doc, tree="lxml", format="html", omit_optional_tags=True, strip_whitespace=True)
<!DOCTYPE html><head> <title>1</title> </head> <body> <video> <source type=video/ogg src=1.ogv> </video>

it even supports alphabetizing attributes, which @Aeyoun wanted IIRC

felixfontein commented 9 years ago

@ralsina: looks good, at least the try with omit_optional_tags=False. Interestingly, both W3C validators accept the shorter fragments without <body>, </body> and </html> as valid. Apparently, the closing tags have already been optional for some time (http://stackoverflow.com/questions/3008593/html-include-or-exclude-optional-closing-tags). Also, here: http://www.w3schools.com/htmL/html5_syntax.asp it is written that <body> and <html> can be omitted as well. Interesting; I really didn't knew that.

felixfontein commented 9 years ago

In case anyone cares, the detailed rules for HTML5 in which precise cases something can be omitted are here: http://www.w3.org/TR/html5/syntax.html#optional-tags

da2x commented 9 years ago

You loose XML parsability by dropping those tags. (Only doctype and the title element are strictly required by the spec.)

felixfontein commented 9 years ago

You already lose XML parsability by using HTML5, since closing some tags (such as <source>) is illegal. Anyway, I'd prefer to have as many closing tags as possible (i.e. allowed).

ralsina commented 9 years ago

We can trivially make this configurable if you guys tell me this serializer is worth using (I defer because I am no expert in this)

Kwpolska commented 9 years ago

<!DOCTYPE html><html><head>
    <title>1</title>
  </head>
  <body>
    <video>
      <source type=video/ogg src=1.ogv>
    </video>

</body></html>

Are there really no quotes around video/ogg and 1.ogv? That’s ugly.

ralsina commented 9 years ago

Well, that's legal as long as the attribute value has no " ' ` = < > or space...

Kwpolska commented 9 years ago

It is legal, but it just looks bad. Besides, I’m not sure if we should change the entire rendering pipeline just to address one minor tag…

da2x commented 9 years ago

I’ve bashed html5lib over the head with the troublesome areas I know about (was a browser tester for five years, after all) and can’t immediately find anything wrong with it. I’d say it’s better† than the current solution and recommend that we proceed with html5lib.

I suggest providing two presets in an option (HTML_OUTPUT = "minified" | "xml-like") to make everyone stylistically happy. The two main choices boils down to minified versus XML(-like). Defaulting to "xml-like" will produce output similar to what Nikola does today.

Minified:

quote_attr_values=False
omit_optional_tags=True
minimize_boolean_attributes=True (can break stringmatched JS/CSS, but nothing in Nikola uses that by default)
strip_whitespace=True
alphabetical_attributes=True
escape_lt_in_attrs=True (seen so many tagsoup readers matching against < over the years so we should definitely always escape this, even though not required by spec)

XML-like:

quote_attr_values=True
use_trailing_solidus=True
alphabetical_attributes=True
escape_lt_in_attrs=True
strip_whitespace=False (because no understanding of xml:space nor CDATA)

† From glancing over their code, they seem to know what a bastard HTML (authors) can be and know how to deal with it. Solves problems I’ve worked around using the tidy5 filters (which is much slower than html5lib) and cough custom regex filters. Produces output that should yield better compression.

felixfontein commented 9 years ago

I like Aeyoun's proposal a lot. XML-like should be the default, since it is closest to what lxml outputs (if I'm not mistaken).

da2x commented 9 years ago

The main differences with "XML-like" from today’s output is attribute sorting and consistent trailing solidus. I also agree that XML-like should be default, even though I’m likely to opt for minified myself. (Got to save those bytes.)

ralsina commented 9 years ago

If we refactor "write this tree to a file" into a separate function (which we should anyway, because it's duplicated) we can make this all work easy and improve the code quality and test coverage at the same time.

ralsina commented 9 years ago

I have not tried it but we may want to keep using lxml's parser and just use html5lib's serializer.

Kwpolska commented 9 years ago

If I were to make the final decision, I would not give users a choice (since less than 10 people are going to be aware of the option’s existence, let alone know the difference and actually care about that) and instead pick a good compromise of the options. I would personally opt for minified + quote_attr_values=True, but that’s only my opinion.

ralsina commented 9 years ago

Blah, html5lib crashes if the LXML tree contains a ProcessingInstruction node :-(

It looks like a bug (or rather a missing case) in their lxmltree.TreeWalker.getNodeDetails

I am so not going to shave that yak...

For an example failure, check the enter-html5lib branch with the demo site's charts story:

$ cd cache/posts/
$ python
Python 2.7.8 (default, Oct 20 2014, 15:05:19) 
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml, html5lib
>>> doc=lxml.html.fromstring(open('charts.html').read())
>>> html5lib.serializer.serialize(doc,tree='lxml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/__init__.py", line 16, in serialize
    return s.render(walker(input), encoding)
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 306, in render
    return "".join(list(self.serialize(treewalker)))
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/serializer/htmlserializer.py", line 196, in serialize
    for token in treewalker:
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/filters/optionaltags.py", line 17, in __iter__
    for previous, token, next in self.slider():
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
    for token in self.source:
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/_base.py", line 144, in __iter__
    details = self.getNodeDetails(currentNode)
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/lxmletree.py", line 150, in getNodeDetails
    match = tag_regexp.match(ensure_str(node.tag))
  File "/home/ralsina/.virtualenvs/nikola/local/lib/python2.7/site-packages/html5lib/treewalkers/lxmletree.py", line 18, in ensure_str
    return s.decode("utf-8", "strict")
AttributeError: 'builtin_function_or_method' object has no attribute 'decode'

da2x commented 9 years ago

Added two new filters so anyone can quickly test this against their sites: html5lib_minify and html5lib_xmllike.

ralsina commented 9 years ago

Since this doesn't get rid of LXML, A filter is a very good idea, I say that's it.

getnikola / nikola

Nikola makes HTML5 video source tags invalid #1768