Closed wizzard0 closed 2 years ago
I don't have use cases for this and I guess your one is a crawler based on linkedom ... but if it's real DOM behavior you are after, with the engine that knows what is visible, what not, and what content should be considered for innerText
I suggest you use a real browser and not this project? I have zero interest in bringing in here all the quirks that innerText
, a thing added after Internet Explorer, has ... this is not why this project exists but also: PR welcome if it doesn't destroy peroformance and it's not on the way for future features and improvements 👍
OK I accept the point that innerText is non-normative as per https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute
Meantime the workaround if anybody stumbles on the same issue:
let w=document.createTreeWalker(document)
let t=[];
while(true){
let n = w.nextNode(); if(!n){break;}
if(n.nodeType==3 && n.parentNode?.localName !='style'){
t.push(n.textContent);
}else if(BLOCK_ELEMENTS.has(n.localName)){
t.push('\n');
}
}
let text = t.join(' ')
@wizzard0 if that's all it takes I might implement that myself but ...
In short, I am not sure you've solved the issue there, but if you did, I can easily bring that in or you can file a PR.
In short, I am not sure you've solved the issue there
Yeah I'm also not sure. I've searched a bit for a spec that could be followed and haven't found one.
So I guess I need to collect more edge cases first anyway.
First, thanks for the linkedom! It's rly fast and robust compared to alternatives, AND it has way more and way better examples, which is awesome.
Now, for the issue :) I'm trying to extract text from webpages to declutter them (like Reader mode), but observed that innerText also returns contents of tags that aren't included in the parent node innerText in browsers,
<style>
in particular.Repro (compare to
document.documentElement.innerText
in browser console on any webpage)Note that if you query innerText on the style tag directly, you'll get the CSS in the browser too. But not if you query innerText of the enclosing element.