aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
384 stars 31 forks source link

Add a context argument to Soup.parse #28

Closed dmbaturin closed 4 years ago

dmbaturin commented 4 years ago

Sometimes it's undesirable to infer the parsing context. One example is mentioned in https://github.com/aantron/markup.ml/issues/48 Some elements are allowed in both head and body, and it's not always possible to infer the context unambiguously.

A suggested solution is to add a context argument to Soup.parse. I'm ready to take it up, but we need to decide on syntax and behaviour first. Do we want to expose the polymorphic variant type there, or make it something less error-prone, like ?(body_context=true)?

aantron commented 4 years ago

Before continuing with this, is there a natural demand for it? That is, have you run into a need for this, that isn't covered by the band-aid fix in aantron/markup.ml#48? If not, my suggestion right now would be to fall back to Soup.from_signals paired with Markup.parse_html ~context, and we can expose ~context more directly in Lambda Soup only if resorting to this combination of functions becomes common. Another option is val parse_in_context.

If we do add this, I think we would want the fully general context parameter. In particular, there are more than <head> and <body> contexts. For a specific example, <noscript>foo</noscript> is:

and these are not yet all the different contexts!

dmbaturin commented 4 years ago

I agree, better not rush it until/unless it's common. I'm going to play with a user-configurable behaviour and from_signals to see how it feels.