aredridel / html5

Event-driven HTML5 Parser in Javascript
http://dinhe.net/~aredridel/projects/js/html5/
MIT License
590 stars 168 forks source link

Lack of documentation. #110

Open ruipgil opened 10 years ago

ruipgil commented 10 years ago

There's a lack of documentation, even the example of the README is outdated or isn't explained correctly.

aredridel commented 10 years ago

Yes indeed! Needs some TLC.

Any aspects you want to see first?

ruipgil commented 10 years ago

Since the project is used widely used with JSDOM, an annotated (JSDOM) example should always be up to date. Also, you could use tests as examples, to make sure everything works fine. More of an usage test, than an unit test. And with this kind of tests you'd only need to redirect people to the source code of the example.

eGavr commented 10 years ago

Can you give a real work example of using your tool in nodejs without jQuery?

aredridel commented 10 years ago

The v1.0.1 README now has an example.

eGavr commented 10 years ago

Thank you, but it seems, that it is not simple example. Why is ti so difficult? Lots of code for such a simple example...

Can I do something like this:

var parser = require('parse5');
var html = '<p>blah</p>';

console.log(parser.parse(html));

and after console.log receive the full DOM tree?

danyaPostfactum commented 10 years ago

HTML5 does not contain any DOM implementation. So, you have to provide it. If you just need DOM tree:

var HTML5 = require('html5');
var jsdom = require('jsdom');

var DOMImplementation = jsdom.level(3).DOMImplementation;
var parser = new HTML5.DOMParser(new DOMImplementation());

var document = parser.parse('<p>I am a very small HTML document</p>');

console.log(document.getElementsByTagName("p")[0].textContent);

Also, take a look at SAXParser:

var HTML5 = require('html5');

var parser = new HTML5.SAXParser();

parser.contentHandler = {
    startDocument: function() {},
    endDocument: function() {},
    startElement: function(uri, localName, qName, atts) {
        console.log('Start of <' + localName + '> element');
    },
    endElement: function(uri, localName, qName) {
        console.log('End of <' + localName + '> element');
    },
    characters: function(ch, start, length) {
        console.log('Characters: ' + ch);
    }
};

parser.parse('<p>I am a very small HTML document</p>');
Start of <html> element
Start of <head> element
End of <head> element
Start of <body> element
Start of <p> element
Characters: I am a very small HTML document
End of <p> element
End of <body> element
End of <html> element
eGavr commented 10 years ago

Great! I think that SAXParser is that what I need!

BUT!

<p>I am a very small HTML document</p>

Where is the html, head element in the input etc?

Can I receive the info exactly about the input?

danyaPostfactum commented 10 years ago

Where is the html, head element in the input etc?

Parser creates all these elements according to HTML spec (browsers do the same). You can use fragment parsing algorithm:

parser.parseFragment('<p>I am a very small HTML document</p>', 'body');

Fragment parsing was broken. I fixed it right now, so you need to pull latest change (still not sure i fixed the bug properly).

Can I receive the info exactly about the input?

No, you receive repaired, well-formed output. This parser may create, forbid, reparent elements etc according to the HTML5 parsing specification.

eGavr commented 10 years ago
var HTML5 = require('html5');

var parser = new HTML5.SAXParser();

parser.contentHandler = {
    startDocument: function() {console.log('!!!!')},
    endDocument: function() {console.log('????')},
    startElement: function(uri, localName, qName, atts) {
        console.log("qNAme == ", qName)
        console.log(atts)
        console.log('Start of <' + localName + '> element');
    },
    endElement: function(uri, localName, qName) {
        console.log('End of <' + localName + '> element');
    },
    characters: function(ch, start, length) {
        console.log('Characters: ' + ch);
    }
};

parser.parseFragment('<p>I am a very small HTML document</p>', 'body');

This code doesn't work! I'm sorry ) Probably there is a silly mistake I haven't noticed! Can you help me?

eGavr commented 10 years ago

Can you give the information about contentHandler? Now I know these ones startDocument, endDocument, startElement, endElement, characters!

Are there anything else?

danyaPostfactum commented 10 years ago

This code works. You should pull this fix: https://github.com/aredridel/html5/commit/4ff67be755fed8696c262f9c262da13c62d04c4d This is not available via npm.

Are there anything else?

No. There is a lexicalHandler, that can handle comments, doctype, cdata sections. But this feature is not implemented yet (but it is very easy to do).

eGavr commented 10 years ago

Are you going to do this?) And can you say an approximate date of the release with these changes?

I mean, it would be great if you could combine contentHandler and lexicalHandler into one Handler!

This way, everybody will be able to create the DOM tree of HTML code in manner as they want!

aredridel commented 10 years ago

Start an issue for 'em -- this one's about docs! -- and we'll go from there.

aredridel commented 10 years ago

And that fix is shipped in v1.0.3

danyaPostfactum commented 10 years ago

Lexical handler now can be defined:

parser.lexicalHandler = {
    comment: function(data) {
        console.log('Comment: ' + data);
    },
    startDTD: function(name, publicIdentifier, systemIdentifier) {
        console.log('Doctype: ' + name);
    },
    endDTD: function() {}
};

contentHandler is required, while lexicalHandler is optional.

http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html http://www.saxproject.org/apidoc/org/xml/sax/ext/LexicalHandler.html

everybody will be able to create the DOM tree of HTML code in manner as they want!

Right. With SAXParser they are able.

eGavr commented 10 years ago

Is it in v1.0.3?

aredridel commented 10 years ago

Yep.