aredridel / html5

Event-driven HTML5 Parser in Javascript
http://dinhe.net/~aredridel/projects/js/html5/
MIT License
590 stars 168 forks source link

Improve handler #111

Open eGavr opened 10 years ago

eGavr commented 10 years ago

Besides, what about this situation:

<tag></tag>

and

<tag/>

?

It seems, the contentHandler parses them just in the same way! Yes, they are identical for a browser, but in the point of view of parsing they are not identical, are they?

aredridel commented 10 years ago

They are -- the HTML5 parser only concerns itself with parsing to construct a DOM.

eGavr commented 10 years ago

Are you going to fix this situation?

aredridel commented 10 years ago

Does it need to be fixed? What's the use-case?

eGavr commented 10 years ago

Yes! For example, when I want to transform my DOM tree back to html!

In the cases I've shown above, SAXparser parses them in the same way! It is a little bit unfair on the hand of your SAXparser.

danyaPostfactum commented 10 years ago

SAXParser notifies about element start and element end, not about start tag and end tag. That's all.

Yes, they are identical for a browser, but in the point of view of parsing they are not identical

If you really need low-level parsing info, you can use Tokenizer.

For example, when I want to transform my DOM tree back to html!

There is a limited set of VOID elements, so it is easy to serialize. http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#serialising-html-fragments

danyaPostfactum commented 10 years ago

Example of producing HTML from SAX events: https://gist.github.com/danyaPostfactum/ee94c3bf88b99fb94c4b Example:

var SAXParser = require('html5').SAXParser;
var HtmlSerializer = require('./HtmlSerializer').HtmlSerializer;

var outStream = require('fs').createWriteStream("out.html");

var parser = new SAXParser();
var serializer = new HtmlSerializer(outStream);

parser.contentHandler = parser.lexicalHandler = serializer;

parser.parse('...');
eGavr commented 10 years ago

But how can I understand whether the tag is self closing?

danyaPostfactum commented 10 years ago

Just check it's name matches one of area, base, basefont, bgsound, br, col, embed, frame, hr, img, input, keygen, link, menuitem, meta, param, source, track or wbr element.

eGavr commented 10 years ago

But if someone is so bad person and want to parse an invalid input?

<bra/>?, for example?

eGavr commented 10 years ago

Thank you for the list of self closing text!

danyaPostfactum commented 10 years ago

<bra/>, for example?

According to spec, it will be interpreted as <bra>. You can check this in your browser.

Thank you for the list of self closing text!

See http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#serialising-html-fragments

eGavr commented 10 years ago

But I can try to parse this situation :

<br></br>

It will be for browser - <br>,but what will I receive after serialization?

In two cases I will receive the same, but two inputs were not the same.

Maybe it is necessary to add a parameter into one of your contentHandler's method, it will be true if the tag is self closing?

danyaPostfactum commented 10 years ago

Parser will ignore </br> tag.

In two cases I will receive the same, but two inputs were not the same.

Yes, invalid markup will be repaired. I already said about it. Even valid input markup may not match serialized output. Could you explain how do you want to use parser? Probably you need another tool.

eGavr commented 10 years ago

For example, I want to check the validity of input or as in my case I want to compare to HTML! For me it is necessary to check the HTMLs as they are!

danyaPostfactum commented 10 years ago

For example, I want to check the validity of input

This parser is used in http://ace.c9.io/build/kitchen-sink.html (select HTML mode) for syntax checking.

parser.errorHandler = {
    error: function(message, location, code) {
        // Parse error
    }
};

For me it is necessary to check the HTMLs as they are!

Not sure what do you mean. I guess you have to write your own parser.