aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
380 stars 31 forks source link

How to use `$`? #48

Closed vitalydolgov closed 2 years ago

vitalydolgov commented 2 years ago

It's not actually an issue, rather a question on usage of Lambdasoup. For some reason I cannot use selector $ after taking a node by number (the second statement after binding). But it works well if I convert node to string and parse it again, or take element of node explicitly.

Is it an intentional behavior? In the source code I see no restriction on the node type, so I'm a bit confused...

# #require "lambdasoup";;

# open Soup;;

# let s = "<p class=\"txtRed\">AA * A<span class=\"txtNormal\">B</span> * A<span class=\"txtNormal\">C</span></p>";;
val s : string = ...

# s |> parse $ "p" |> children |> R.nth 2 |> to_string;;
- : string = "<span class=\"txtNormal\">B</span>"

# s |> parse $ "p" |> children |> R.nth 2 $? "span";;
- : element node option = None

# s |> parse $ "p" |> children |> R.nth 2 |> R.element |> name;;
- : string = "span"

# s |> parse $ "p" |> children |> R.nth 2 |> to_string |> parse $ "span" |> to_string;;
- : string = "<span class=\"txtNormal\">B</span>"
aantron commented 2 years ago

In

# s |> parse $ "p" |> children |> R.nth 2 $? "span";;

$? selects from the descendants of the given node, in other words it is searching the DOM corresponding to the string B, and of course there are no elements at all to find there.

The reason this might be confusing is because the top-level node returned by parse is not the <p> element, but a "soup" (document) node which contains the <p> element as its child. It is done that way because, in general, the string you pass to parse may contain multiple elements, and indeed multiple nodes, since it might contain text at the top level.

aantron commented 2 years ago

Likewise, when you convert your span DOM to a string and pass it back to parse, you get back a DOM consisting of a document whose child is the span element. I guess it's pretty annoying and non-algebraic that trying to round-trip an element through the parser doesn't give back an element, but a document containing that element.

vitalydolgov commented 2 years ago

@aantron thank you for the quick answer, now I get it. That's not a problem, the library is very convenient to use 😊