Closed SavageMessiah closed 5 years ago
Thanks for using the library and I'm glad you like the APIs :)
We can definitely make single level selectors happen. After some thinking, I'm leaning toward adding something like the following two methods:
-- | Constrains a selector to only match tags that are at the top level
-- of the current context.
top :: Selectable a => a -> Selector
-- | Short hand for `a // top b`.
(///) :: (Selectable a, Selectable b) => a -> b -> Selector
As for full on CSS selectors, if we were to add them I think it'd be best to preserve the syntax as opposed to trying massage the expressions into valid Haskell. The only ways I know how to make that happen are to either use quasiqotes or parse strings at run-time. I'd lean the former for the type safety, but it would be a large departure from the library as it is today.
That sounds pretty good. combining top and any would provide an easy way to walk the immediate children of a node as well.
Every time I've reached for this library over the past year or so, I've been disappointed by the lack of this feature. Just now I was about to file a feature request, and someone's beaten me to the punch :smile:.
I look forward to seeing it implemented.
I like the interface for the most part, but I have a few suggestions to consider:
(///)
should be the arbitrary depth one, (//)
should be the shallow onetop
is a special case of a depth :: Selectable a => Int -> a -> Selector
(though it probably disallows negative numbers) which matches only if some element at the specified depth (0 for top level, etc.) matches the selector.Thank you for developing such a useful tool!
Yeah, I don't think CSS selectors are really that important, I'd rather write haskell anyway.
I'd agree that (//) being the shallow one and (///) being the deep one would make more sense but is it worth making a breaking change over?
I also agree that top being a special case of depth would be nice, though I think depth specifying a maximum depth rather than a specific one would be more useful. I'm not basing that belief on anything other than a gut feeling though.
[I]s it worth making a breaking change over?
I think so (with the appropriate version bump, of course); no library depends on scalpel
(okay... acme-everything
does, but that doesn't count), and web scraping is notoriously fragile anyway. I don't imagine there's a lot of meticulously maintained applications depending on scalpel
.
I agree that (///)
makes more sense as the arbitrary depth operator but was hesitant to make a breaking change... but... since scalpel's small and if the non-trivial fraction of users on this thread think it's a good idea I'd be down :)
As for depth
, I think it might be worth while to have a method for depth up to and one for exact depth. I can't think of an elegant way to implement one given the other so it seems like the library should provide both.
I can't think of an elegant way to implement one given the other [...]
Perhaps not elegant, but if we have a "consider only nodes this deep or deeper" and any sort of intersection...
The library should provide both in any case.
How would this be implemented? It seems like currently there's a list of elements that forms a 'fuzzy path'. Would there be a new kind of element introduced to express adjacency?
My latest thinking on this is to have a new scraper, depth
, which would return the depth of the match. This would be similar the already existing position function.
You could get single level selectors by doing:
chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
guard =<< (1 ==) <$> depth
text anySelector
This isn't as concise as the originally proposed (///)
but would be more flexible in that it would allow for conditions on arbitrary depths and would compose well with position
.
As far as how this would actually be implemented, the depth of the current node could be added to the SelectContext type which holds ephemeral meta-data for nodes that can change depending on the context.
I think having an API like
chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
guard =<< (1 ==) <$> depth
text anySelector
could lead to inefficient code in the presence of lots of nodes/nesting. IIUC, what is going to happen is that the entire tree will be flattened and then you will filter it, so we are not exploiting the fact that depth monotonically increases to trim the deeper branches. Instead, we are processing the whole tree every time.
Is my understanding correct?
That's a good point. In the use case of filtering to a constant depth this would be less efficient than having a selector which has a chance to short circuit DFS paths.
I am also open to alternate APIs and/or supporting multiple APIs here. I think there is value in being able to read the current depth, but it may not be the best way to enforce depth.
atDepth
has been added in version 0.6.0
which confines matches based on the depth.
The selector to select <b>
tags one level under <a>
tags would be
"a" // "b" `atDepth` 1
Any additional functionality can be addressed in future issues if they prove necessary.
I can't see any way to select tags that are immediate children of a parent rather than ones at an arbitrary nesting. Basically I have some html like this:
I'd like to chroot into the nested div and do stuff stuff that takes into account the nested spans. I'd ALSO like to process all the spans at the top level of div "a" without touching those under the other div.
In my actual use case I was able to work around this because of what I was doing. In general, though, this is a feature I would expect from any scraping API. If you were super-motivated and just added css selectors that would help a lot :p
I hope you can solve this, this is one of the most pleasant scraping APIs I've ever used, much more pleasant that handsomesoup.