fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Replace Maybe with a descriptive error type (Either) #98

Closed KaneTW closed 11 months ago

KaneTW commented 1 year ago

When parsing fails for whatever reason, scrape always returns a Maybe due to the MaybeT transformer. This loses any error information for why parsing actually failed.

I propose replacing MaybeT with ExceptT.

fimad commented 1 year ago

That would require a breaking API change, which is doable but I want to make sure we have all the aspects of it thought through:

KaneTW commented 1 year ago

I started working on it locally, but it seems complicated. Would need some architectural changes I don't have the overview for.

The way Parsec reports errors seems pretty appropriate, but I'd have to look up how exactly it works under the hood

benjaminweb commented 1 year ago

Chiming in here, @KaneTW. Context: Scraping product from product page. Two outcomes: a. We can scrape a Product from the page. b. Or we cannot (because the page has changed and our scraper no longer work).

We want to know once the page has changed so we can adopt our scraper accordingly!

Approach. We collect everything for a Product but don’t construct it within the scraper. That way we can let return the scraper always something. Let’s call it ProductDto. This ProductDto carries only Maybe values.

Now, in case our scraper fails to even return a Just ProductDto, it will return a Nothing. Getting that, we could translate it to a null ProductDto, holding only values of Nothing. That’d be a wrapper around our scraper.

So, now that we always get a ProductDto, we always have a construction kit for constructing a Product. If the ProductDto is valid as to we can construct a Productfrom it, we do just that. If values are invalid or we don’t have everything to construct a Product, we translate the specific condition to something like ScrapeError with variants explaining what the condition is. If we have multiple conditions, we might return multiple conditions to be comprehensive here.

Now, the function’s signature could look like:

createProduct :: ProductDto -> Either Product [ScrapeError]

@fimad what’s your take on this?

benjaminweb commented 11 months ago

related: https://github.com/fimad/scalpel/issues/8 since then we've got the Monad transformer: https://github.com/fimad/scalpel/pull/87

benjaminweb commented 11 months ago

I've defined ScraperT str (Except e) a and try to implement textComment''' that should return exceptions for unavailable fields. Therefore textComment''' should do a throwError "some sensible message" when the selector cannot find the specified target.

How can I check whether author <- text authorSel was not successful?

https://github.com/benjaminweb/scalpel-exceptt/blob/main/src/Lib.hs#L136-L141

fimad commented 11 months ago

That's a good point about this already being supported via monad transformers.

I took a look at this and you can do this by wrapping Either in the ScraperT class. When you unwrap the result, you'll need to check for 3 cases, (1) an explicit error, (2) a failed scraping without an error, and (3) a valid result:

type Error = String
type ScraperWithError a = ScraperT String (Either Error) a

scrapeStringOrError :: String -> ScraperWithError a -> Either Error a
scrapeStringOrError html scraper
        | Left error    <- result  = Left error
        | Right Nothing <- result  = Left "Unknown error"
        | Right (Just a) <- result = Right a
    where
    result = scrapeStringLikeT html scraper

To add explicit erroring you can use the <|> operator from Alternative to throw and error when something fails:

comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> throwError "Unknown comment type"

With this approach, when you throw an error it will stop all parsing. So if you have a nested throwError in a in the expression a <|> b. Even if b would be successful, the parsing will fail.

Another approach that would let you accumulate errors would be to use a MonadWriter and accumulate debugging information in a Monoid like a list rather than short circuiting on hard errors.

Below is a modified example from the docs that uses this kind of error checking:

{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel
import Control.Applicative
import Control.Monad.Error.Class (throwError)

exampleHtml :: String
exampleHtml = "<html>\
\    <body>\
\        <div class='comments'>\
\            <div class='comment container'>\
\                <span class='comment author'>Sally</span>\
\                <div class='comment text'>Woo hoo!</div>\
\            </div>\
\            <div class='comment container'>\
\                <span class='comment author'>Bill</span>\
\                <img class='comment image' src='http://example.com/cat.gif' />\
\            </div>\
\            <div class='comment container'>\
\                <span class='comment author'>Susan</span>\
\                <div class='comment text'>WTF!?!</div>\
\            </div>\
\            <div class='comment container'>\
\                <span class='comment author'>Susan</span>\
\                <div class='comment video'>A video? That's new!</div>\
\            </div>\
\        </div>\
\    </body>\
\</html>"

type Error = String

type Author = String

data Comment
    = TextComment Author String
    | ImageComment Author URL
    deriving (Show, Eq)

type ScraperWithError a = ScraperT String (Either Error) a

scrapeStringOrError :: String -> ScraperWithError a -> Either Error a
scrapeStringOrError html scraper
        | Left error    <- result  = Left error
        | Right Nothing <- result  = Left "Unknown error"
        | Right (Just a) <- result = Right a
    where
    result = scrapeStringLikeT html scraper

main :: IO ()
main = print $ scrapeStringOrError exampleHtml comments
    where
    comments :: ScraperWithError [Comment]
    comments = chroots ("div" @: [hasClass "container"]) comment

    comment :: ScraperWithError Comment
    comment = textComment <|> imageComment <|> throwError "Unknown comment type"

    textComment :: ScraperWithError Comment
    textComment = do
        author      <- text $ "span" @: [hasClass "author"]
        commentText <- text $ "div"  @: [hasClass "text"]
        return $ TextComment author commentText

    imageComment :: ScraperWithError Comment
    imageComment = do
        author   <- text       $ "span" @: [hasClass "author"]
        imageURL <- attr "src" $ "img"  @: [hasClass "image"]
        return $ ImageComment author imageURL

This prints out:

Left "Unknown comment type"
fimad commented 11 months ago

Given that this doesn't require a ton of additional code on the users behalf, I plan to update the documentation with some of these examples but not change the API to add types like ScraperWithError.

I think it makes sense to keep the API simple and unopinionated about error handling and then the user can choose the types and approach that makes sense for them / works well with their code base.