fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Sequences #45

Closed xkollar closed 5 years ago

xkollar commented 8 years ago

Hi. I like your library :+1:. However, I do not see any clear/obvious way how to parse (/scrape)

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

into something like

type Title = String
type Paragraph = String -- For simplicity
data Part = Part Title [Paragraph]

expected :: [Part]
expected =
    [ Part "title2 1" ["text 1", "text 2"]
    , Part "title2 2" ["text 3"]
    , Part "title2 3" []
    ]

If I just miss something, would you consider adding this into examples. Or maybe a slight change in combinators? Or maybe introduce some sequence operator?

Probably related to issue #41.

Thanks :-).

fimad commented 8 years ago

Yeah, unfortunately there isn't a great way to do this today :(

A lot of scalpel's internals currently rely on the assumption that an HTML document is a tree and scraping/parsing involves selecting the sub-tree that you care about and extracting data from that sub-tree.

I've opened up #48 as a sort of meta-issue to solve the general problem of selecting multiple sub-trees. If you have an ideas for what a good API would look like please post them there :)

fimad commented 5 years ago

This is now supported in version 0.6.0. This specific issue is added as a regression test:


    ,   scrapeTest
            "Issue #45 regression test"
            (unlines [
              "<body>"
            , "  <h1>title1</h1>"
            , "  <h2>title2 1</h2>"
            , "  <p>text 1</p>"
            , "  <p>text 2</p>"
            , "  <h2>title2 2</h2>"
            , "  <p>text 3</p>"
            , "  <h2>title2 3</h2>"
            , "</body>"
            ])
            (Just [
              ("title2 1", ["text 1", "text 2"])
            , ("title2 2", ["text 3"])
            , ("title2 3", [])
            ])
            (chroot "body" $ inSerial $ many $ do
                title <- seekNext $ text "h2"
                ps <- untilNext (matches "h2") (many $ do
                  -- New lines between tags count as text nodes, skip over
                  -- these.
                  optional $ stepNext $ matches textSelector
                  stepNext $ text "p")
                return (title, ps))```