Closed Iiridayn closed 5 years ago
Got getInnerText
working well enough for me - see https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute if you'd like the gory details of what I skipped.
// doesn't handle table rows or cells, or apply css - should be adequate though
function getInnerText(node) {
if (node.type === 'br')
return '\n';
let text = '';
node.children.forEach(child => {
if (child.constructor.name === 'HtmlTag') {
text += getInnerText(child);
} else if (child.constructor.name === 'TextNode') {
text += child.text;
} else {
console.warn('Unhandled type', child.constructor.name);
}
});
if (node.type === 'p')
return '\n' + text + '\n';
else
return text;
}
Already some updates as I've moved on to the next scrapper - check not symbol in dataset property and apparently sometimes dom
is an array and sometimes not, so handled that.
I didn't intend for this library to be a drop-in DOM replacement; there are popular libraries for Node.js that do a much better job of that. That is why the DOM representation in html-soup
is more of an abstract syntax tree and doesn't directly expose attributes or support computed properties like innerText
and innerHTML
. With that said, I think dataset
seems like a useful property to support, so I'll try to add it. However, the WHATWG standard for innerText
depends heavily on how the HTML is rendered, so I don't think it really makes sense to implement it for html-soup
. (For example, your code is missing special cases for <ul>
and <ol>
elements, which have a newline between each list element in Chrome's innerText
implementation.)
Added in 03236c7df74d8fc749ba3e314fa6df755ddd7a65
Makes sense. Might you recommend a drop-in DOM replacement which doesn't use a virtual browser and has minimal dependencies? I admit I only spent about 90 minutes searching before I settled on using your library since it beat out the best alternatives I've been able to find so far.
Yeah - I looked into both. cheerio
seems to be pretty popular, but mimics jQuery instead of querySelector
- and while I've only touched on it peripherally and a decade ago, I recall the syntax being slightly different. jsdom
on the other hand supports script execution... which is fantastic only for those who need the complexity. Your library on the other hand gets me 95% of the way there, and adding stuff like dataset
was the work of only a couple hours. I suspect with jsdom
I'd still be reading their documentation... :/.
I may have misremembered though. I might just be acting old and stubborn to resist adding jQuery as a dependency to my browser extension (which I'd think is kinda ridiculous). Still, with it being a 95% solution and me being too stubborn to risk needing a jQuery dependency, html-soup
was exactly what I was looking for :).
I used something like this to get the attributes I needed (getInnerText redacted as it's currently returning wrong values):
And call it as:
Having support for this on the library level would be great. My usecase - I'm sharing scraping code between the server (node.js) and a browser extension - I've also wrapped
htmlSoup.select
to behave likedocument.querySelector
anddocument.querySelectorAll
as my scrapper code has been using those two functions.