Tokenizer splits a script block, if it contains tags even if it is inside litterals.

O-Zone commented 13 years ago

I'm using the Tokenizer to chop up html snippets, and it is working like a dream. Unfortunately it has this one odd idea of splitting up a script block, if it contains tags - even though they are inside litterals. I don't know if this is by design or by accident, but I would expect it to handle the whole script block as one Character token.

Here is an example of the problem:

<body>
  <script>
    var x = "<a href='myPage.htm'>myLink<\/a>"
  </script>
</body>

When tokenized, this snippet will result in something in the (simplified) neighbourhood of:

StartTag: body
StartTag: script
Characters: 'var x = "'
StartTag: a
Characters: 'myLink<\/a>'
EndTag: script
EndTag: body

I would have expected 3-5 to be all in one Character node. The end a-tag also results in a ParseError, since it is escaped and thus does not get recognized. So my code ends up deeming this snippet for malformed, lacking the end a-tag.

Is this a bug, or is it by design, because it should be handled elsewhere (the Parser?). I'm not using the parser because of speed and a desire to keep everything streaming nicely not having to wait for the whole tree to be build and managed.

My guess is that this thing would count for poorly named selectors in a style tag too, but I haven't tried that out yet.

aredridel commented 13 years ago

Definitely a problem. Let me assemble a test case, unless you have one.

aredridel commented 13 years ago

The parser and tokenizer don't exactly work independently in the HTML5 algorithm. The fallbacks are scattered between them.

The parse tree it creates is correct as far as I can tell:

<html>
  <head>
  <body>
    "
  "
    <script>
      "
    var x = "<a href='myPage.htm'>myLink<\/a>"
  "
    "

"

aredridel commented 13 years ago

I assume you need a token stream more than a DOM.

I intend to create such a thing, though it's hard with the various crazy documents out there.

O-Zone commented 13 years ago

Yes - what I need is really a token stream. I thought it was because the separation of the tokenizer and the parser was not that strict. Great to hear if you are planning on making a stand alone tokenizer! :-)

Beside this relatively isolated problem, I have found no flaws in the tokenizer though. Really nice work! :-)

aredridel commented 13 years ago

Thanks!

papandreou commented 12 years ago

I'm O-Zone's colleague and I wound up working on the same piece of code and just stumbled upon this issue :)

The problem was a side effect of using the tokenizer on its own, because some of the tokenizer.content_model housekeeping is actually performed by the parser. I ended up copying the logic that transitions to SCRIPT_CDATA and PCDATA into the module in our code that uses the tokenizer, thus solving the problem.

It's kind of a gotcha for people who expect to be able to use the tokenizer on its own, but it doesn't really strike me as a bad design -- unless someone can come up with a clean way of moving that transitioning code into the tokenizer (I doubt it).

aredridel commented 12 years ago

Yeah. The HTML5 algorithm doesn't leave the tokenizer separate, sadly. I'm still thinking of ways to minimally involve the parser to do the right state transitions.

aredridel commented 12 years ago

I'm going to close and open an issue to track separating the tokenizer.

aredridel / html5

Tokenizer splits a script block, if it contains tags even if it is inside litterals. #17