Closed xkollar closed 5 years ago
Yeah, unfortunately there isn't a great way to do this today :(
A lot of scalpel's internals currently rely on the assumption that an HTML document is a tree and scraping/parsing involves selecting the sub-tree that you care about and extracting data from that sub-tree.
I've opened up #48 as a sort of meta-issue to solve the general problem of selecting multiple sub-trees. If you have an ideas for what a good API would look like please post them there :)
This is now supported in version 0.6.0
. This specific issue is added as a regression test:
, scrapeTest
"Issue #45 regression test"
(unlines [
"<body>"
, " <h1>title1</h1>"
, " <h2>title2 1</h2>"
, " <p>text 1</p>"
, " <p>text 2</p>"
, " <h2>title2 2</h2>"
, " <p>text 3</p>"
, " <h2>title2 3</h2>"
, "</body>"
])
(Just [
("title2 1", ["text 1", "text 2"])
, ("title2 2", ["text 3"])
, ("title2 3", [])
])
(chroot "body" $ inSerial $ many $ do
title <- seekNext $ text "h2"
ps <- untilNext (matches "h2") (many $ do
-- New lines between tags count as text nodes, skip over
-- these.
optional $ stepNext $ matches textSelector
stepNext $ text "p")
return (title, ps))```
Hi. I like your library :+1:. However, I do not see any clear/obvious way how to parse (/scrape)
into something like
If I just miss something, would you consider adding this into examples. Or maybe a slight change in combinators? Or maybe introduce some sequence operator?
Probably related to issue #41.
Thanks
:-)
.