aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

cannot select id with ":" in name #26

Closed sanette closed 5 years ago

sanette commented 5 years ago

Hi, thanks for this great library. I noticed the following behaviour:

# let html = "<div id=\"section:2\">Hello</div>";;
# let soup = parse html;;
# let div_ok = soup $ "div[id=section:2]";;
val div_ok : Soup.element Soup.node = <abstr>
# let div_wrong = soup $ "div#section:2";;
Exception:
Failure "Soup.Selector.parse: unknown pseudo-class or pseudo-element ':2'".
aantron commented 5 years ago

The second syntax shouldn't work, as I understand it:

  1. ID selectors are # followed by a CSS identifier: https://www.w3.org/TR/selectors-3/#id-selectors, second paragraph.

  2. Here is the identifier syntax: https://www.w3.org/TR/CSS21/syndata.html#value-def-identifier. It doesn't allow literal :. I don't remember immediately if you can use an escape sequence to work around this in Lambda Soup, but technically, you'd have to insert the escape sequence in both the id attribute and in the selector. Of course, if this is some non-compliant HTML from an external source, we may have to work around this somehow, so I suggest trying an escape sequence, or sticking with the attribute selector [id=...].

sanette commented 5 years ago

thanks for the fast answer! I found this in the ocaml manual generated by Hevea ;)

aantron commented 5 years ago

Argh! I suggest also opening an issue in the ocaml repo about the non-compliant IDs.

sanette commented 5 years ago

Right; I'll do this

sanette commented 5 years ago

note that Chris00 has a different interpretation. See https://github.com/ocaml/ocaml.org/issues/1093