Open tysonzero opened 5 years ago
For something analogous to .:
in Aeson, are you envisioning a function with a type like Scalpable a => Selector -> Scraper a
that can be used in place of existing primitives like text
and attr
?
This sounds useful to me, if you are interested in implementing this feel free to fire up a PR.
Yes that is basically what I was thinking.
After thinking about it some more there is a key difference between Aeson and Scalpel that makes this change in primitives hard:
In Aeson a Parser
(as a side note the name is rather misleading) has already had it's input applied, so its approximately Maybe a
. In Scalpel a Scraper
takes in input and is more like Html -> Maybe a
, whereas string parsing libraries like parsec are more like String -> Maybe (String, a)
as they need to move along the string.
String parsing libraries are more or less stuck with the State
-like approach, but JSON/HTML/XML etc. libraries get to make a choice:
The advantage of the former approach is that it is significantly more powerful and allows you to basically build everything on top of polymorphic primitives like .:
, as long as you have a suitable instance that basically just returns the input as-is: instance FromJSON Value
, as well as a Monad
instance for the "Parser".
The advantage of the latter is that because of it's reduced power you can potentially implement optimizations not otherwise possible, and allow for things like printing out the structure of the parser itself (e.g. BNF form).
So with that said I think a class would be nice either way, but .:
equivalents are probably only worthwhile if Scalpel changed it's approach to the Aeson approach and allowed you to directly pass around the Html
(TagSpec
-ish thing) object and apply various parsers to it at any time.
This would IMO be a very nice and intuitive interface. SerialScraper
would probably still have a monadic interface since unlike Scraper
it IS stateful just like a Parsec parser.
One potential interface could be something like this:
data Node str = -- single node
data Html str = -- zero or more nodes
class StringLike str => FromNode str a where
fromNode :: Node str -> Maybe a
class FromNode str a => FromHtml str a where
fromHtml :: Html str -> Maybe a
-- fromHtml =<< fromNode n = fromHtml n
-- always succeed
instance FromNode str (Node str)
instance FromNode str (Html str)
instance FromHtml str (Html str)
-- wouldn't always succeed, and would probably be better to intentionally leave off
-- instance FromHtml str (Node str)
prepare :: StringLike str => [Tag str] -> Html str
attr :: StringLike str => String -> Node str -> Maybe str
text :: StringLike str => Node str -> str
select :: FromNode str a => Selector -> Html str -> [a]
inSerial :: StringLike str => Serial str a -> Html str -> Maybe a
stepNext :: FromNode str a => Serial str a
seekNext :: FromHtml str b => (Node str -> Maybe a) -> Serial str (a, b)
I might be misunderstanding .:
but it seems like the main benefit it provides would be that it allows to extract more complicated values than inner text or attributes without having to resort to explicitly chroot'ing into a sub-tree.
For example:
class Scrapable str a where
scraper :: Scraper str a
extract :: (Scrapable str a, StringLike str) => Selector -> Scraper str a
extract selector = chroot selector scraper
extracts :: (Scrapable str a, StringLike str) => Selector -> Scraper str a
extracts selector = chroots selector scraper
Internally scalpel basically uses an interface like the one you propose, the context is just implicitly passed via a Monad in the public API. Seems like you could get something similar by partially applying the existing scraping functions and passing around a Scraper str a -> Maybe a
.
Right now, I think the parsing will probably be redone for each application, but I think you could probably restructure the internals such that the parsing is only performed once.
I might be misunderstanding .: but it seems like the main benefit it provides would be that it allows to extract more complicated values than inner text or attributes without having to resort to explicitly chroot'ing into a sub-tree.
That is essentially true. However what makes this benefit so substantial in Aeson
is that.:
allows you to parse both fully defined end objects (like your extract
above) and also allows you to parse to a Value
/Object
etc. that can then be further parsed / passed around etc. So Aeson
does not need an equivalent to both extract
and chroot
, it just needs .:
.
Currently adding functions like the above to scalpel would not allow any existing functions to be removed, as they do not supersede any existing functions. So while they are nice convenience functions, they don't really simplify the interface or give any composeable benefits.
Internally scalpel basically uses an interface like the one you propose, the context is just implicitly passed via a Monad in the public API. Seems like you could get something similar by partially applying the existing scraping functions and passing around a Scraper str a -> Maybe a
To allow for both possible APIs to be as clean and performant as possible, one option could be combining the Html str -> Maybe a
approach with something like ReaderT
over the top for when you want to a series of operations over the same context.
This would be very useful for things like interacting with Servant. I also like the way Aeson uses this for things like polymorphic
.:
, so that could be worth looking into.See this for the current class we are using for this purpose and this for the way we are integrating it with Servant.