aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
384 stars 31 forks source link

Weird behavior of CSS child combinator with self-closing tags #55

Closed mooreryan closed 1 year ago

mooreryan commented 1 year ago

Not sure if this is a bug or me doing something weird.

First here is some lambdasoup code. It reads the soup from standard in, and then there's a tiny function to print out the name and an attribute of the nodes given a selector. Finally a driver at the bottom that uses two different selectors.

  let soup = read_channel In_channel.stdin |> parse |> signals |> from_signals

  let f soup selector =
    soup |> select selector
    |> iter (fun node ->
           match element node with
           | None ->
               ()
           | Some node ->
               let which = node |> R.attribute "which" in
               let name = node |> name in
               print_endline (name ^ ": " ^ which) )

  let () =
    f soup "a b" ;
    print_endline "===========" ;
    f soup "a > b"

Given this xml file

<?xml version="1.0"?>
<a>
  <b which="first">
    <c />
    <c />
    <b which="first-of-first">
      <c />
      <c />
    </b>
  </b>
  <b which="second">
    <c />
    <c />
  </b>
</a>

running that code would give this:

b: first                          
b: first-of-first
b: second
===========
b: first
b: second

Looks good: when using the css child combinator (>) I don't get the first-of-first b node as it is under a c node.

Now, the weird thing is, if I change the b nodes to bb (or anything with more than one character), and then adjust the selector accordingly, I get this:

bb: first                          
bb: first-of-first
bb: second
===========
bb: first

Only the first bb node is printed and not the second one.

mooreryan commented 1 year ago

Just for reference ...I tested it out with Nokogiri (a ruby xml/html parser)

require 'nokogiri'

data = File.open ARGV.first
xml_doc  = Nokogiri::XML data

xml_doc.css('a > bb').each do |x|
  puts x[:which]
end

and got the first and second nodes selected as expected.

aantron commented 1 year ago

This is almost certainly because you are having Lambda Soup read the input as HTML. Part of the HTML parser is to do error recovery (it's specified in the spec).

The <a> tag can have nested <b> tags, but a <bb> tag is something that triggers error recovery and gets rotated outside the <a> tag, per the spec, changing the structure of the loaded DOM.

For the example, you might be able to replace <a> with something else. But you probably need to parse the input as XML. See here.

mooreryan commented 1 year ago

Ahhh okay I see the problem, makes sense. Thanks!!