Open ruipgil opened 10 years ago
Yes indeed! Needs some TLC.
Any aspects you want to see first?
Since the project is used widely used with JSDOM, an annotated (JSDOM) example should always be up to date. Also, you could use tests as examples, to make sure everything works fine. More of an usage test, than an unit test. And with this kind of tests you'd only need to redirect people to the source code of the example.
Can you give a real work example of using your tool in nodejs
without jQuery
?
The v1.0.1 README now has an example.
Thank you, but it seems, that it is not simple example. Why is ti so difficult? Lots of code for such a simple example...
Can I do something like this:
var parser = require('parse5');
var html = '<p>blah</p>';
console.log(parser.parse(html));
and after console.log
receive the full DOM tree?
HTML5 does not contain any DOM implementation. So, you have to provide it. If you just need DOM tree:
var HTML5 = require('html5');
var jsdom = require('jsdom');
var DOMImplementation = jsdom.level(3).DOMImplementation;
var parser = new HTML5.DOMParser(new DOMImplementation());
var document = parser.parse('<p>I am a very small HTML document</p>');
console.log(document.getElementsByTagName("p")[0].textContent);
Also, take a look at SAXParser:
var HTML5 = require('html5');
var parser = new HTML5.SAXParser();
parser.contentHandler = {
startDocument: function() {},
endDocument: function() {},
startElement: function(uri, localName, qName, atts) {
console.log('Start of <' + localName + '> element');
},
endElement: function(uri, localName, qName) {
console.log('End of <' + localName + '> element');
},
characters: function(ch, start, length) {
console.log('Characters: ' + ch);
}
};
parser.parse('<p>I am a very small HTML document</p>');
Start of <html> element
Start of <head> element
End of <head> element
Start of <body> element
Start of <p> element
Characters: I am a very small HTML document
End of <p> element
End of <body> element
End of <html> element
Great! I think that SAXParser is that what I need!
BUT!
<p>I am a very small HTML document</p>
Where is the html
, head
element in the input etc?
Can I receive the info exactly about the input?
Where is the html, head element in the input etc?
Parser creates all these elements according to HTML spec (browsers do the same). You can use fragment parsing algorithm:
parser.parseFragment('<p>I am a very small HTML document</p>', 'body');
Fragment parsing was broken. I fixed it right now, so you need to pull latest change (still not sure i fixed the bug properly).
Can I receive the info exactly about the input?
No, you receive repaired, well-formed output. This parser may create, forbid, reparent elements etc according to the HTML5 parsing specification.
var HTML5 = require('html5');
var parser = new HTML5.SAXParser();
parser.contentHandler = {
startDocument: function() {console.log('!!!!')},
endDocument: function() {console.log('????')},
startElement: function(uri, localName, qName, atts) {
console.log("qNAme == ", qName)
console.log(atts)
console.log('Start of <' + localName + '> element');
},
endElement: function(uri, localName, qName) {
console.log('End of <' + localName + '> element');
},
characters: function(ch, start, length) {
console.log('Characters: ' + ch);
}
};
parser.parseFragment('<p>I am a very small HTML document</p>', 'body');
This code doesn't work! I'm sorry ) Probably there is a silly mistake I haven't noticed! Can you help me?
Can you give the information about contentHandler
?
Now I know these ones startDocument
, endDocument
, startElement
, endElement
, characters
!
Are there anything else?
This code works. You should pull this fix: https://github.com/aredridel/html5/commit/4ff67be755fed8696c262f9c262da13c62d04c4d This is not available via npm.
Are there anything else?
No. There is a lexicalHandler
, that can handle comments, doctype, cdata sections. But this feature is not implemented yet (but it is very easy to do).
Are you going to do this?) And can you say an approximate date of the release with these changes?
I mean, it would be great if you could combine contentHandler
and lexicalHandler
into one Handler
!
This way, everybody will be able to create the DOM tree of HTML
code in manner as they want!
Start an issue for 'em -- this one's about docs! -- and we'll go from there.
And that fix is shipped in v1.0.3
Lexical handler now can be defined:
parser.lexicalHandler = {
comment: function(data) {
console.log('Comment: ' + data);
},
startDTD: function(name, publicIdentifier, systemIdentifier) {
console.log('Doctype: ' + name);
},
endDTD: function() {}
};
contentHandler is required, while lexicalHandler is optional.
http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html http://www.saxproject.org/apidoc/org/xml/sax/ext/LexicalHandler.html
everybody will be able to create the DOM tree of HTML code in manner as they want!
Right. With SAXParser they are able.
Is it in v1.0.3?
Yep.
There's a lack of documentation, even the example of the README is outdated or isn't explained correctly.