fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Single Level Selectors #21

Closed SavageMessiah closed 5 years ago

SavageMessiah commented 8 years ago

I can't see any way to select tags that are immediate children of a parent rather than ones at an arbitrary nesting. Basically I have some html like this:

<div id="a">
   <div>
      <span></span>
      <span></span>
   </div>
   <span></span>
</div>

I'd like to chroot into the nested div and do stuff stuff that takes into account the nested spans. I'd ALSO like to process all the spans at the top level of div "a" without touching those under the other div.

In my actual use case I was able to work around this because of what I was doing. In general, though, this is a feature I would expect from any scraping API. If you were super-motivated and just added css selectors that would help a lot :p

I hope you can solve this, this is one of the most pleasant scraping APIs I've ever used, much more pleasant that handsomesoup.

fimad commented 8 years ago

Thanks for using the library and I'm glad you like the APIs :)

We can definitely make single level selectors happen. After some thinking, I'm leaning toward adding something like the following two methods:

-- | Constrains a selector to only match tags that are at the top level
-- of the current context.
top :: Selectable a => a -> Selector

-- | Short hand for `a // top b`.
(///) :: (Selectable a, Selectable b) => a -> b -> Selector 

As for full on CSS selectors, if we were to add them I think it'd be best to preserve the syntax as opposed to trying massage the expressions into valid Haskell. The only ways I know how to make that happen are to either use quasiqotes or parse strings at run-time. I'd lean the former for the type safety, but it would be a large departure from the library as it is today.

SavageMessiah commented 8 years ago

That sounds pretty good. combining top and any would provide an easy way to walk the immediate children of a node as well.

rpglover64 commented 8 years ago

Every time I've reached for this library over the past year or so, I've been disappointed by the lack of this feature. Just now I was about to file a feature request, and someone's beaten me to the punch :smile:.

I look forward to seeing it implemented.

I like the interface for the most part, but I have a few suggestions to consider:

Thank you for developing such a useful tool!

SavageMessiah commented 8 years ago

Yeah, I don't think CSS selectors are really that important, I'd rather write haskell anyway.

I'd agree that (//) being the shallow one and (///) being the deep one would make more sense but is it worth making a breaking change over?

I also agree that top being a special case of depth would be nice, though I think depth specifying a maximum depth rather than a specific one would be more useful. I'm not basing that belief on anything other than a gut feeling though.

rpglover64 commented 8 years ago

[I]s it worth making a breaking change over?

I think so (with the appropriate version bump, of course); no library depends on scalpel (okay... acme-everything does, but that doesn't count), and web scraping is notoriously fragile anyway. I don't imagine there's a lot of meticulously maintained applications depending on scalpel.

fimad commented 8 years ago

I agree that (///) makes more sense as the arbitrary depth operator but was hesitant to make a breaking change... but... since scalpel's small and if the non-trivial fraction of users on this thread think it's a good idea I'd be down :)

As for depth, I think it might be worth while to have a method for depth up to and one for exact depth. I can't think of an elegant way to implement one given the other so it seems like the library should provide both.

rpglover64 commented 8 years ago

I can't think of an elegant way to implement one given the other [...]

Perhaps not elegant, but if we have a "consider only nodes this deep or deeper" and any sort of intersection...

The library should provide both in any case.

sordina commented 6 years ago

How would this be implemented? It seems like currently there's a list of elements that forms a 'fuzzy path'. Would there be a new kind of element introduced to express adjacency?

fimad commented 6 years ago

My latest thinking on this is to have a new scraper, depth, which would return the depth of the match. This would be similar the already existing position function.

You could get single level selectors by doing:

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

This isn't as concise as the originally proposed (///) but would be more flexible in that it would allow for conditions on arbitrary depths and would compose well with position.

As far as how this would actually be implemented, the depth of the current node could be added to the SelectContext type which holds ephemeral meta-data for nodes that can change depending on the context.

typesanitizer commented 6 years ago

I think having an API like

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

could lead to inefficient code in the presence of lots of nodes/nesting. IIUC, what is going to happen is that the entire tree will be flattened and then you will filter it, so we are not exploiting the fact that depth monotonically increases to trim the deeper branches. Instead, we are processing the whole tree every time.

Is my understanding correct?

fimad commented 6 years ago

That's a good point. In the use case of filtering to a constant depth this would be less efficient than having a selector which has a chance to short circuit DFS paths.

I am also open to alternate APIs and/or supporting multiple APIs here. I think there is value in being able to read the current depth, but it may not be the best way to enforce depth.

fimad commented 5 years ago

atDepth has been added in version 0.6.0 which confines matches based on the depth.

The selector to select <b> tags one level under <a> tags would be

"a" // "b" `atDepth` 1

Any additional functionality can be addressed in future issues if they prove necessary.