aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
380 stars 31 forks source link

Expose the matches_selector function in the interface #40

Closed dmbaturin closed 2 years ago

dmbaturin commented 3 years ago

There's a way to select elements from a tree, but there's no way to check if a node would match a given selector.

That can be convenient for filtering element lists returned by Soup.select. The :not() pseudo-class makes that much harder to do from code since you need intricate string manipulation, and it still doesn't cover all use cases easily, so I think exposing the already existing but internal matches_selector function is justified.

The case where I wanted that was the new ignore_selectors option for soupault's table of contents widget. The :not() pseudo-class wouldn't cut it because the widget needs to select all headings (h1, h2, ...) and the user potentially would have to add a :not part to each of them, while a separate ignore_selectors option composes with heading selection much more nicely.

I made a dirty hack to allow ignoring simple selectors (without combinators), but a full solution requires exposing the now-internal functionality of the Selector module.

dmbaturin commented 2 years ago

@aantron What do you think of this PR?

aantron commented 2 years ago

This looks fine, except that I would suggest to reorder the arguments so that they are root_node selector at_node.

dmbaturin commented 2 years ago

Done.

I agree it's a better API, I just followed your argument order in the initial version, under the assumption that it's what you want it to be.

aantron commented 2 years ago

Thanks! I don't remember why I chose that order for select. It's the opposite of the operators. Is it correct that you'd like a release with this function?

dmbaturin commented 2 years ago

Yes, if you can make a release with it, that would be great. I'll replace the hacky function in soupault with this one and also make a release then.

dmbaturin commented 2 years ago

Do you have an estimated timeframe for the new OPAM release?

(If you don't, that's time, I'm myself quite overloaded lately!)

aantron commented 2 years ago

I am coming back from overload, so I don't. It's not difficult to do, so I may do it at any moment :)

dmbaturin commented 2 years ago

I'm planning to make a new soupault release with some bug fixes, but if you plan to make a release say within a couple of week, I'll put that on hold to also replace the HTML.matches_selector with a non-hacky version.

dmbaturin commented 2 years ago

What's your plan regarding the OPAM release?

aantron commented 2 years ago

@dmbaturin I've just sent it to opam in https://github.com/ocaml/opam-repository/pull/19789.

dmbaturin commented 2 years ago

Thanks!