Closed KaneTW closed 11 months ago
That would require a breaking API change, which is doable but I want to make sure we have all the aspects of it thought through:
I started working on it locally, but it seems complicated. Would need some architectural changes I don't have the overview for.
The way Parsec reports errors seems pretty appropriate, but I'd have to look up how exactly it works under the hood
Chiming in here, @KaneTW.
Context: Scraping product from product page.
Two outcomes:
a. We can scrape a Product
from the page.
b. Or we cannot (because the page has changed and our scraper no longer work).
We want to know once the page has changed so we can adopt our scraper accordingly!
Approach. We collect everything for a Product
but don’t construct it within the scraper.
That way we can let return the scraper always something. Let’s call it ProductDto
.
This ProductDto
carries only Maybe
values.
Now, in case our scraper fails to even return a Just ProductDto
, it will return a Nothing
.
Getting that, we could translate it to a null ProductDto
, holding only values of Nothing
.
That’d be a wrapper around our scraper.
So, now that we always get a ProductDto
, we always have a construction kit for constructing a Product
.
If the ProductDto
is valid as to we can construct a Product
from it, we do just that.
If values are invalid or we don’t have everything to construct a Product
, we translate the specific condition to something like ScrapeError
with variants explaining what the condition is. If we have multiple conditions, we might return multiple conditions to be comprehensive here.
Now, the function’s signature could look like:
createProduct :: ProductDto -> Either Product [ScrapeError]
@fimad what’s your take on this?
related: https://github.com/fimad/scalpel/issues/8 since then we've got the Monad transformer: https://github.com/fimad/scalpel/pull/87
I've defined ScraperT str (Except e) a
and try to implement textComment'''
that should return exceptions for unavailable fields.
Therefore textComment'''
should do a throwError "some sensible message"
when the selector cannot find the specified target.
How can I check whether author <- text authorSel
was not successful?
https://github.com/benjaminweb/scalpel-exceptt/blob/main/src/Lib.hs#L136-L141
That's a good point about this already being supported via monad transformers.
I took a look at this and you can do this by wrapping Either
in the ScraperT
class. When you unwrap the result, you'll need to check for 3 cases, (1) an explicit error, (2) a failed scraping without an error, and (3) a valid result:
type Error = String
type ScraperWithError a = ScraperT String (Either Error) a
scrapeStringOrError :: String -> ScraperWithError a -> Either Error a
scrapeStringOrError html scraper
| Left error <- result = Left error
| Right Nothing <- result = Left "Unknown error"
| Right (Just a) <- result = Right a
where
result = scrapeStringLikeT html scraper
To add explicit erroring you can use the <|>
operator from Alternative
to throw and error when something fails:
comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> throwError "Unknown comment type"
With this approach, when you throw an error it will stop all parsing. So if you have a nested throwError
in a
in the expression a <|> b
. Even if b
would be successful, the parsing will fail.
Another approach that would let you accumulate errors would be to use a MonadWriter
and accumulate debugging information in a Monoid
like a list rather than short circuiting on hard errors.
Below is a modified example from the docs that uses this kind of error checking:
{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.Scalpel
import Control.Applicative
import Control.Monad.Error.Class (throwError)
exampleHtml :: String
exampleHtml = "<html>\
\ <body>\
\ <div class='comments'>\
\ <div class='comment container'>\
\ <span class='comment author'>Sally</span>\
\ <div class='comment text'>Woo hoo!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Bill</span>\
\ <img class='comment image' src='http://example.com/cat.gif' />\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Susan</span>\
\ <div class='comment text'>WTF!?!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Susan</span>\
\ <div class='comment video'>A video? That's new!</div>\
\ </div>\
\ </div>\
\ </body>\
\</html>"
type Error = String
type Author = String
data Comment
= TextComment Author String
| ImageComment Author URL
deriving (Show, Eq)
type ScraperWithError a = ScraperT String (Either Error) a
scrapeStringOrError :: String -> ScraperWithError a -> Either Error a
scrapeStringOrError html scraper
| Left error <- result = Left error
| Right Nothing <- result = Left "Unknown error"
| Right (Just a) <- result = Right a
where
result = scrapeStringLikeT html scraper
main :: IO ()
main = print $ scrapeStringOrError exampleHtml comments
where
comments :: ScraperWithError [Comment]
comments = chroots ("div" @: [hasClass "container"]) comment
comment :: ScraperWithError Comment
comment = textComment <|> imageComment <|> throwError "Unknown comment type"
textComment :: ScraperWithError Comment
textComment = do
author <- text $ "span" @: [hasClass "author"]
commentText <- text $ "div" @: [hasClass "text"]
return $ TextComment author commentText
imageComment :: ScraperWithError Comment
imageComment = do
author <- text $ "span" @: [hasClass "author"]
imageURL <- attr "src" $ "img" @: [hasClass "image"]
return $ ImageComment author imageURL
This prints out:
Left "Unknown comment type"
Given that this doesn't require a ton of additional code on the users behalf, I plan to update the documentation with some of these examples but not change the API to add types like ScraperWithError
.
I think it makes sense to keep the API simple and unopinionated about error handling and then the user can choose the types and approach that makes sense for them / works well with their code base.
When parsing fails for whatever reason, scrape always returns a Maybe due to the MaybeT transformer. This loses any error information for why parsing actually failed.
I propose replacing MaybeT with ExceptT.