WebReflection / linkedom

A triple-linked lists based DOM implementation.
https://webreflection.medium.com/linkedom-a-jsdom-alternative-53dd8f699311
ISC License
1.66k stars 80 forks source link

bug: innerText returns content of child <style> tags #145

Closed wizzard0 closed 2 years ago

wizzard0 commented 2 years ago

First, thanks for the linkedom! It's rly fast and robust compared to alternatives, AND it has way more and way better examples, which is awesome.

Now, for the issue :) I'm trying to extract text from webpages to declutter them (like Reader mode), but observed that innerText also returns contents of tags that aren't included in the parent node innerText in browsers, <style> in particular.

Repro (compare to document.documentElement.innerText in browser console on any webpage)

Note that if you query innerText on the style tag directly, you'll get the CSS in the browser too. But not if you query innerText of the enclosing element.

let {parseHTML} = require('linkedom');
let str = "<html><p><style>p{margin:0;}</style>visible</p></html>";
let text = parseHTML(str).document.documentElement.innerText;
console.log(text); 
// expected: "visible"
// observed: "p{margin:0;}\nvisible"
let text2 = parseHTML(str).document.getElementsByTagName('style')[0].innerText;
console.log(text2); 
// observed: "p{margin:0;}" - this is OK!
WebReflection commented 2 years ago

I don't have use cases for this and I guess your one is a crawler based on linkedom ... but if it's real DOM behavior you are after, with the engine that knows what is visible, what not, and what content should be considered for innerText I suggest you use a real browser and not this project? I have zero interest in bringing in here all the quirks that innerText, a thing added after Internet Explorer, has ... this is not why this project exists but also: PR welcome if it doesn't destroy peroformance and it's not on the way for future features and improvements 👍

wizzard0 commented 2 years ago

OK I accept the point that innerText is non-normative as per https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute

Meantime the workaround if anybody stumbles on the same issue:

  let w=document.createTreeWalker(document)
  let t=[];
  while(true){
    let n = w.nextNode(); if(!n){break;}
    if(n.nodeType==3 && n.parentNode?.localName !='style'){
      t.push(n.textContent);
    }else if(BLOCK_ELEMENTS.has(n.localName)){
      t.push('\n');
    }
  }
  let text = t.join(' ')
WebReflection commented 2 years ago

@wizzard0 if that's all it takes I might implement that myself but ...

In short, I am not sure you've solved the issue there, but if you did, I can easily bring that in or you can file a PR.

wizzard0 commented 2 years ago

In short, I am not sure you've solved the issue there

Yeah I'm also not sure. I've searched a bit for a spec that could be followed and haven't found one.

So I guess I need to collect more edge cases first anyway.