aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
380 stars 31 forks source link

Selector quoted attribute values #6

Closed yannham closed 8 years ago

yannham commented 8 years ago

According to reference https://www.w3.org/TR/css3-selectors/#attribute-selectors and https://www.w3.org/TR/CSS21/syndata.html#strings, attributes values in selectors can be either identifier or quoted strings. While identifiers are syntactically heavily restricted and I don't see any problem with lambdasoup being more resilient, selectors like

soup $ "[attr='a value']" match markups of the form <markup attr="&#39;a value&#39;"/>

and similarly soup $ "[attr=\"a value\"]" match markups of the form <markup attr="&quot;a value&quot;"/>

I see a few problems with that :

This PR tries to solve the problem by 1) modifying parse_quoted_string to parse string delimited by simple or double quotes, with possibility of escaping the delimiter inside the string if preceded by '\' 2) in the function that parses attributes values (parse_string), parse the string as a quoted string using parse_quoted_string if it begins with a simple or a double quote, or parse it exactly as before if it doesn't. 3) Adding some tests to validate the behavior

aantron commented 8 years ago

Thanks, I agree that this should be fixed. Will take a look at the patches in detail in a bit.

aantron commented 8 years ago

Looks great! Thanks again.

Could you please amend the first commit to take care of the inline nit?

Would you prefer a speedy update in OPAM, or are you comfortable using a development version for a while?

aantron commented 8 years ago

Also, this intentionally does not try to correctly support newlines, as described in https://www.w3.org/TR/CSS21/syndata.html#strings, but that's fine. That can be added later if someone needs it. I just want to note it, in case someone looks in this discussion.

yannham commented 8 years ago

Thanks, lambda-soup is a cool library, appreciate your work on it too. I added a commit to take care of the dangling space. I don't need a speedy opam update, development version is fine. It indeed does not try to support newlines (neither the precise syntax specified in the W3C reference, which is more restrictive), because in pure OCaml using a plain newline would do just fine I think. I guess problems could appear if one try to read W3C-compliant selectors from an external source, but as you state, it can be quite easily added later if needed : the lesser, the safer ^^

aantron commented 8 years ago

Could you please squash the whitespace fix into 55c3b91972f36387ae142427903b5a095c9d0245, giving this PR a 2-commit history? It will then be as nice and tidy as when originally submitted (thank you for that!), and ready for merge.

yannham commented 8 years ago

Should be ok now

aantron commented 8 years ago

Much appreciated!