WebReflection / linkedom

A triple-linked lists based DOM implementation.
https://webreflection.medium.com/linkedom-a-jsdom-alternative-53dd8f699311
ISC License
1.71k stars 82 forks source link

[BUG] `document.head` is sometimes blank when meta tags occur before `<html>` #281

Closed Yash-Singh1 closed 4 months ago

Yash-Singh1 commented 4 months ago

First off, thanks for making this project! :)

I ran into this issue while migrating from jsdom over to linkedom.

Repro:

const linkedom = require('linkedom');
function JSDOM(html) { return linkedom.parseHTML(html); }
const {document} = new JSDOM(await fetch('https://xdaforums.com/t/double-tap-to-wake.3306272/').then(r => r.text()))
console.log(document.title) // ''
// head isn't detect despite it existing in the original page
document.head.innerHTML // ''
document.head.getElementsByTagName('title') // NodeList (0) []

This problem seems to be related to this website having meta and link tags before the <html> tag opens (removing them manually fixes the issue):

Screenshot 2024-07-22 at 5 06 15 PM

I'm not sure if this is allowed per spec or not, but Chrome seems to handle the meta/link tags, so I think it makes sense to handle them in linkedom as well:

Screenshot 2024-07-22 at 5 07 25 PM
WebReflection commented 4 months ago

This is easily out of project's goal/scope ... the moment I sanitize the entire Internet is the moment this project makes no sense anymore and no performance will ever exist neither ... you have playwright or any other crawler to sanitize via real browsers the Web and I don't think this should ever affect LinkeDOM logic.