aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

Selector list #15

Open Lyokovic opened 6 years ago

Lyokovic commented 6 years ago

Hi,

I started using Lambda Soup and found that it does not seems to support selector lists, like ".bg1, .bg3". I need to parse an HTML document with various <div> with bg2 bg1 bgbc bg3 classes and want to keep only the bg1 and bg3 ones while keeping the order.

I am wondering if it would be easy to implement this feature?

aantron commented 6 years ago

Yes, it should be fairly straightforward. One would have to:

  1. Extend the grammar of selectors with one more level: https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L489

    simple_selector is stuff like .class-foo, [attribute-bar], combinators are >, +, etc. So, this grammar is capable of representing things like .class-foo > [attribute-bar]. It needs one more level of list to be able to represent comma-separated lists of these.

  2. This is the parser top-level function. It needs to be modified to become not the top-level function, but a parser for a single item delimited by ,, and then a new top-level function needs to wrap it, that reads commas, and calls the current parser for reading everything in between. https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L896-L913

  3. This is the select code. Its logic needs to be wrapped in a new top-level loop that tries additional selectors from the new top-level list if the preceding ones didn't yield a match. https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L611-L647

Lyokovic commented 6 years ago

Thanks, I'll take a look ASAP.