Drain after DOCTYPE results in bad tokenization

aredridel / html5

Event-driven HTML5 Parser in Javascript

http://dinhe.net/~aredridel/projects/js/html5/

MIT License

590 stars 168 forks source link

Drain after DOCTYPE results in bad tokenization #75

Closed dgreensp closed 11 years ago

dgreensp commented 11 years ago

If I just tokenize the string "<!DOCTYPE html>", the doctype token comes out with a name of "htmltml". It seems to be a problem with the system of draining and undoing the buffer.

Is there some rule that's typically followed in the parser to avoid cases like this, but which is violated for doctypes? Knowing that would give me more confidence in the implementation. I actually haven't been playing with the tokenizer for very long; this is one of the first inputs I tried.

Thanks!

aredridel commented 11 years ago

That one is, I believe, one of the failing test cases -- and you may have nailed the reason for it better than I have.

I'll look into it. The drains in the middle of a couple of states may well not be handled properly -- mostly, it's undoable and works great, but it does have to be handled manually in a couple of the tokenizer states.

dgreensp commented 11 years ago

Adding a buffer.commit() to doctype_name_state seems to help.

aredridel commented 11 years ago

That looks like the right answer. I'm trying it out now.