Closed dgreensp closed 11 years ago
That one is, I believe, one of the failing test cases -- and you may have nailed the reason for it better than I have.
I'll look into it. The drains in the middle of a couple of states may well not be handled properly -- mostly, it's undoable and works great, but it does have to be handled manually in a couple of the tokenizer states.
Adding a buffer.commit()
to doctype_name_state
seems to help.
That looks like the right answer. I'm trying it out now.
If I just tokenize the string
"<!DOCTYPE html>"
, the doctype token comes out with a name of"htmltml"
. It seems to be a problem with the system of draining and undoing the buffer.Is there some rule that's typically followed in the parser to avoid cases like this, but which is violated for doctypes? Knowing that would give me more confidence in the implementation. I actually haven't been playing with the tokenizer for very long; this is one of the first inputs I tried.
Thanks!