TDs are ignored if not inside a TABLE element

bjesus commented 2 months ago

hey, I'm working on scrapper that uses cascadia.

I recently noticed that cascadia ignores <td>s if they appear without a wrapping <table> element:

$ echo '<p>test</p>' | cascadia -i -o -c 'p'
1 elements for 'p':
<p>test</p>

~ via 🐍 v3.12.6
$ echo '<td>test</td>' | cascadia -i -o -c 'td'
0 elements for 'td':

~ via 🐍 v3.12.6
$ echo '<table><td>test</td></table>' | cascadia -i -o -c 'td'
1 elements for 'td':
<td>test</td>

why is that? is there any workaround for this? and are there other elements that are ignored in whatever situations? it seems like naming the element foo works completely fine, so it doesn't even have to be a real HTML element, but for some reason tds fail.

andybalholm commented 2 months ago

The HTML spec only allows <td> elements inside tables. So a spec-compliant HTML parser ignores <td> tags if they aren't inside a table. Try opening an HTML file like that in your browser, and inspecting it with the developer tools, and you'll probably get the same results.

bjesus commented 2 months ago

I understand. Is this part of net/html or cascadia? I wonder if there's a way around it? Perhaps to parse my DOM as if it's just XML without deliberately removing elements because they're not HTML compliant?

andybalholm commented 2 months ago

It's part of net/html. Depending on what you're trying to do, ParseFragment might be what you need. Give it a table or tr node as context for what you want to parse.

bjesus commented 2 months ago

Got it! I'll figure out some workaround then. Thank you very much for the help, and for awesome cascadia!

andybalholm / cascadia

TDs are ignored if not inside a TABLE element #64