fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Make Scraper into a Monad transformer #87

Closed acastello closed 4 years ago

acastello commented 4 years ago

The current Scraper implementation doesn't allow for weaving in IO operations while scraping, which is anti-intuitive since more often than not you do want to interact with src or href attributes in an immediate IO manner. mtl can be used to add functionality to the Scraper monad, including general lifting and IO lifting in particular. Rewriting the Scraper and SerialScraper data types into transformers along with GeneralizedNewtypeDeriving also saves lots in writing instances and handling the Maybe logic, using MaybeT instead.

acastello commented 4 years ago

Looks good just a few last nits. Thanks for working on this, and for adding examples and documentation!

I'm wondering if it would make sense to add a note / warning somewhere about the need for rate limiting and the potential for visiting a node multiple times. Perhaps on the scrapeT or downloadUrlSpec documentation.

For example, given the following HTML and scraper, I think you'd end up fetching the link multiple times which is probably undesirable:

<div>
  <div>
    <a href="http://example.com">foobar</b>
  </div>
</div>
linksInDivs :: ScraperT Text IO [Text]
linksInDivs = chroots ("div" // "a") do
  url <- attr "href"
  scrapeT text =<< downloadUrlSpec url

Providing a rate limiting and caching framework is out of scope of this PR (and possibly scalpel in general), but I think it probably makes sense to give users a heads up that they would need to think about this if they are doing IO inside a ScraperT.

I think that's something that the user should consider themselves when implementing. I will add an example of a way to handle this using StateT in the README and examples

fimad commented 4 years ago

Looks good!

Is there anything else you are planning to include in the PR? If not, I'll merge this and get a release out with the new APIs.

acastello commented 4 years ago

No, I believe I've made it as mtl-like as possible and documented it properly. Also, I think it's completely backwards-compatible, versioning shouldn't be a hassle.

acastello commented 4 years ago

Ready to merge

fimad commented 4 years ago

I'll try to get a release out either tonight or this weekend. Still trying to decide if this can be released as 0.6.2 or needs a bump to 0.7.0.

fimad commented 4 years ago

Considerably later than "tonight or this weekend" but this has now been released as 0.6.2.