fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

[question] How to select siblings? #41

Closed yamadapc closed 5 years ago

yamadapc commented 8 years ago

In libraries like jQuery/cheerio, given an HTML document like:

<p class="something">Here</p>
<p>Other stuff that matters</p>

You can select "Other stuff that matters" with a selector like: .something+p.

This structure, while not my cup of tea, is used every now and then on websites such as http://hackage.haskell.org.

Is there a way to do this?

fimad commented 8 years ago

There currently isn't a way to do this today. A lot of scalpel's internals assume that all the data you care about is contained within a single sub-tree of the entire HTML document.

I've opened up #48 as a sort of meta-issue to solve the general problem of selecting multiple sub-trees. If you have an ideas for what a good API would look like please post them there :)

fimad commented 5 years ago

This is now supported in version 0.6.0. This specific issue is added as a regression test:

    ,   scrapeTest
            "Issue #41 regression test"
            "<p class='something'>Here</p><p>Other stuff that matters</p>"
            (Just "Other stuff that matters")
            (inSerial $ do
              seekNext $ matches $ "p" @: [hasClass "something"]
              stepNext $ text "p")