fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Allow direct manipulation of TagSpec object #83

Open tysonzero opened 5 years ago

tysonzero commented 5 years ago

Sometimes I would rather work with the node tree (and thus TagSpec) itself rather than the Scraper / SerialScraper interface.

It would be optimal for my use case if TagSpec and various functions for manipulating it (children, name etc.) were exposed as a low level api. The current high level api would then be a layer on top of that and would be the same as it is currently, except perhaps some extra functions for dropping into the low level api when desired.

Of course TagSpec itself would have to be an abstract data type with a hidden constructor / fields rather than a tuple to preserve various invariants from being violated. It would also probably be worth renaming the type to something like Html or Nodes or similar. Another thing to consider would be whether or not its worth having explicit types for when you know you have a single node vs potentially zero or multiple nodes (Tree/Node vs Forest/Nodes/Html) to make functions like name :: Node str -> str make more sense.

fimad commented 5 years ago

This has come up a few times. Historically, I've pushed back on exposing this type directly since I consider it an implementation detail, and I've already changed the type signature dramatically several times to get things to run faster.

However, I would be OK reworking things so that we are able to expose some more low level APIs while keeping the real internals hidden and out of the public API.

Do you have a proposal for what such an API would look like or an idea of what sort of operations you are looking for?

tysonzero commented 5 years ago

Essentially I would like 3 different types:

A list/vector of Node type that signifies a chunk of Html. The low level scrapeStringLike would want to output this type, and things like the children of an Element would have this type.

A Node type which signifies either an Element or a leaf node like Text or a Comment or even an unmatched opening/closing tag.

An Element type which signifies an actual DOM Element with a name and a list/map of attributes as well as a list/vector of children.

I would really like to be able to pattern match on and inspect/print these types, it helps a lot with debugging and intuitiveness.

The exact details of this api and what's underneath it is not important, but basically anything that's easy/intuitive/debuggable with the above Api should ideally not have to be changed too significantly to work.

Text.HTML.TagSoup.Tree is an example of an interface that would definitely work for this, as it meets all of the above requirement (technically one of the above types is a constructor but that works too). We just also want the various combinators provided by scalpel for searching through these types quickly and easily. We also don't like the "preliminary" note at the top of that module.

fimad commented 5 years ago

I think that sounds doable. One area that I think still needs some consideration are the edge cases around how different types of malformed HTML are handled. We'd also need to ensure that we do not regress on performance, a lot of commits went into getting things running fast with the current data structures.

Another option to consider would be exposing a scraper that returns a list of TagSoup Tags. You could then combine this with Text.HTML.TagSoup.Tree to get a tree structure to work with:

tags :: StringLike str => Selector -> Scraper str [Tag str]
tags = foldSpec (\tag s -> tag : s) 

tree :: StringLike str => Selector -> Scraper str (TagTag str)
tree selector = tagTree <$> tags selector

I suspect the first approach would take a nontrivial amount of work and I don't have much bandwidth myself to work on this right now. However, I'd be happy to take a patch if you want to take it on.