fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Optional Scrapers? #39

Closed xanderdunn closed 8 years ago

xanderdunn commented 8 years ago

Example: Scraping Tweets. Some tweets have location information, and some don't. Some tweets have an extra "card url", and some don't.

If I define a scraper like this:

type ScrapeReturn = (T.Text, T.Text, T.Text, T.Text, T.Text, T.Text, T.Text)

tweetScraper :: Scraper T.Text [ScrapeReturn]
tweetScraper = tweets
   where
       tweets :: Scraper T.Text [ScrapeReturn]
       tweets = chroots ("div" @: [hasClass "js-stream-tweet"]) infos

       infos :: Scraper T.Text ScrapeReturn
       infos = do
           author <- attr "data-screen-name" Any
           id <- attr "data-tweet-id" Any
           body <- text $ "div"  @: [hasClass "js-tweet-text-container"]
           counters <- texts $ "span" @: [hasClass "ProfileTweet-actionCountForPresentation"]
           let retweets = head counters
           let likes = counters !! 2
           location <- text $ "span" @: [hasClass "Tweet-geo"]
           card_url <- attr "data-card-url" ("div"  @: [hasClass "js-macaw-cards-iframe-container"])
           return (id, author, location, retweets, likes, T.strip body, card_url)

then it will only return scraped values for those tweets that have both a location and card_url. That is, nothing at all will be returned for a huge majority of tweets, because most tweets are missing either a location or a card_url.

Is it possible to define a Scraper as optional, rather than a necessary match that causes the Scraper to return nothing when it isn't matched?

Or, is there an "and" operator, as opposed to the <|> operator? I could do something like scrape all the locations AND all the card urls AND all the rest of the infos?

Or, this would also be easily achievable with a Scraper that returned a fixed value, something like Scraper "", which returns the empty String. Then I could use the OR operator: location <|> Scraper "".

xanderdunn commented 8 years ago

Indeed, it was very simple! I simply needed to make use of the empty MonadPlus instance of Scraper as an alternative when something like location can't be found: location <|> empty.

sullyj3 commented 2 years ago

Thanks for reporting your findings! I was struggling with the same thing, so this was helpful.