Open Appress opened 1 year ago
Hi there, Parsing fails for some pages ( eg. this article )
To replicate, open the generated html in a browser
const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html'); const html = document.body.innerHTML;
Instead of the original page, it now includes raw html code.
Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...
It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.
I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines
Tokenizer.ts
if (this.isSpecial) { this.state = State.InSpecialTag; this.sequenceIndex = 0; } else { this.state = State.Text; }
With
this.state = State.Text;
It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.
htmlparser2
The culprit is htmlparser2, if I downgrade to v6.1.0
so ... this bug is for a library used by this repository? if that's the case, what are you expecting me to do here? 🤔
Hi there, Parsing fails for some pages ( eg. this article )
To replicate, open the generated html in a browser
Instead of the original page, it now includes raw html code.
Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...
It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.
I tried to debug and the problem is caused in
Tokenizer.ts
. When I simply replace these linesWith
this.state = State.Text;
It works properly. I'm not sure what is the proper fix, which will not affect the performance of
htmlparser2
, so I opened this issue instead.