WebReflection / linkedom

A triple-linked lists based DOM implementation.
https://webreflection.medium.com/linkedom-a-jsdom-alternative-53dd8f699311
ISC License
1.66k stars 80 forks source link

Parsing fails and there is raw html code in rendered html #200

Open Appress opened 1 year ago

Appress commented 1 year ago

Hi there, Parsing fails for some pages ( eg. this article )

To replicate, open the generated html in a browser

    const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html');
    const html = document.body.innerHTML;

Instead of the original page, it now includes raw html code.

Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο&nbsp;<a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...

It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.

I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines

            if (this.isSpecial) {
                this.state = State.InSpecialTag;
                this.sequenceIndex = 0;
            } else {
                this.state = State.Text;
            } 

With

this.state = State.Text;

It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.

WebReflection commented 1 year ago

The culprit is htmlparser2, if I downgrade to v6.1.0

so ... this bug is for a library used by this repository? if that's the case, what are you expecting me to do here? 🤔