Closed O-Zone closed 12 years ago
Definitely a problem. Let me assemble a test case, unless you have one.
The parser and tokenizer don't exactly work independently in the HTML5 algorithm. The fallbacks are scattered between them.
The parse tree it creates is correct as far as I can tell:
<html>
<head>
<body>
"
"
<script>
"
var x = "<a href='myPage.htm'>myLink<\/a>"
"
"
"
I assume you need a token stream more than a DOM.
I intend to create such a thing, though it's hard with the various crazy documents out there.
Yes - what I need is really a token stream. I thought it was because the separation of the tokenizer and the parser was not that strict. Great to hear if you are planning on making a stand alone tokenizer! :-)
Beside this relatively isolated problem, I have found no flaws in the tokenizer though. Really nice work! :-)
Thanks!
I'm O-Zone's colleague and I wound up working on the same piece of code and just stumbled upon this issue :)
The problem was a side effect of using the tokenizer on its own, because some of the tokenizer.content_model
housekeeping is actually performed by the parser. I ended up copying the logic that transitions to SCRIPT_CDATA
and PCDATA
into the module in our code that uses the tokenizer, thus solving the problem.
It's kind of a gotcha for people who expect to be able to use the tokenizer on its own, but it doesn't really strike me as a bad design -- unless someone can come up with a clean way of moving that transitioning code into the tokenizer (I doubt it).
Yeah. The HTML5 algorithm doesn't leave the tokenizer separate, sadly. I'm still thinking of ways to minimally involve the parser to do the right state transitions.
I'm going to close and open an issue to track separating the tokenizer.
I'm using the Tokenizer to chop up html snippets, and it is working like a dream. Unfortunately it has this one odd idea of splitting up a script block, if it contains tags - even though they are inside litterals. I don't know if this is by design or by accident, but I would expect it to handle the whole script block as one Character token.
Here is an example of the problem:
When tokenized, this snippet will result in something in the (simplified) neighbourhood of:
I would have expected 3-5 to be all in one Character node. The end a-tag also results in a ParseError, since it is escaped and thus does not get recognized. So my code ends up deeming this snippet for malformed, lacking the end a-tag.
Is this a bug, or is it by design, because it should be handled elsewhere (the Parser?). I'm not using the parser because of speed and a desire to keep everything streaming nicely not having to wait for the whole tree to be build and managed.
My guess is that this thing would count for poorly named selectors in a style tag too, but I haven't tried that out yet.