Closed acastello closed 4 years ago
Looks good just a few last nits. Thanks for working on this, and for adding examples and documentation!
I'm wondering if it would make sense to add a note / warning somewhere about the need for rate limiting and the potential for visiting a node multiple times. Perhaps on the
scrapeT
ordownloadUrlSpec
documentation.For example, given the following HTML and scraper, I think you'd end up fetching the link multiple times which is probably undesirable:
<div> <div> <a href="http://example.com">foobar</b> </div> </div>
linksInDivs :: ScraperT Text IO [Text] linksInDivs = chroots ("div" // "a") do url <- attr "href" scrapeT text =<< downloadUrlSpec url
Providing a rate limiting and caching framework is out of scope of this PR (and possibly
scalpel
in general), but I think it probably makes sense to give users a heads up that they would need to think about this if they are doing IO inside aScraperT
.
I think that's something that the user should consider themselves when implementing. I will add an example of a way to handle this using StateT
in the README and examples
Looks good!
Is there anything else you are planning to include in the PR? If not, I'll merge this and get a release out with the new APIs.
No, I believe I've made it as mtl
-like as possible and documented it properly. Also, I think it's completely backwards-compatible, versioning shouldn't be a hassle.
Ready to merge
I'll try to get a release out either tonight or this weekend. Still trying to decide if this can be released as 0.6.2
or needs a bump to 0.7.0
.
Considerably later than "tonight or this weekend" but this has now been released as 0.6.2
.
The current
Scraper
implementation doesn't allow for weaving in IO operations while scraping, which is anti-intuitive since more often than not you do want to interact withsrc
orhref
attributes in an immediate IO manner.mtl
can be used to add functionality to theScraper
monad, including general lifting and IO lifting in particular. Rewriting theScraper
andSerialScraper
data types intotransformers
along withGeneralizedNewtypeDeriving
also saves lots in writing instances and handling theMaybe
logic, usingMaybeT
instead.