ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Smart domain-based rate limiting #15

Open noamross opened 10 years ago

noamross commented 10 years ago

When provided a list of URLs, quickscrape could be smart about the order in which they are scraped in order to speed up the process. Each domain could be hit no more frequently than the allowed rate. A couple of approaches:

IF (last_url.domain == current_URL.domain AND current_time < last_time + min_wait)
  current_URL.addTo(skipped_URLs)
  current_URL = next_URL
IF current_URL = last_URL
  URLs = skipped_URLs

Or something like that.

blahah commented 10 years ago

This is on the medium-term to-do list, and I've been thinking about it a bit. I'd like to have a set of browsing behaviours defined, something like:

This being node, it's actually easy to run scraping against all domains simultaneously. So it seems to me the easiest path to a per-domain rate-limit is to queue the input URLs into one queue per domain, and have a pool of workers, one per queue, each rate-limiting itself.

For smart browsing, I'd like to have a human browsing behaviour simulator with a minimum time between requests, stochastic bursts of requests simulating click-batches, and occasional semi-random duration long pauses. This would operate on a per-queue basis.

noamross commented 10 years ago

Nice. I would suggest that rate limits/smart browsing behavior be set in the scraper files, as domains and scrapers usually match up. They could be overridden by a command line argument.

blahah commented 10 years ago

Well, the rate-limiting behaviour will be specific to quickscrape (or rather the backend library, thresher), whereas ScraperJSON is intended to be available for use in any scraping tool that supports it.

I'm open to suggestions on this, but my initial concept was that any ScraperJSON tool can choose its own scraping philosophy (headless or not, rate limiting, etc.), while the ScraperJSON definitions tell it which elements to extract. Under this model, I would expose the rate-limiting behaviour via options in thresher but not make them part of ScraperJSON.