cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

plugins: max-pages, max-pages-per-domain #20

Closed cldellow closed 1 year ago

cldellow commented 1 year ago
      // The maximum number of pages to be crawled from this crawl
      // If absent, no limit
      // NB: seed URLs will always be crawled
      "max-pages": 5,

      // The maximum number of pages to be crawled per any single domain
      // If absent, no limit
      // NB: seed URLs will always be crawled
      "max-pages-per-domain": 5,

Needs https://github.com/cldellow/datasette-scraper#before_fetch_urlscraper-config-url-request_headers (and, as an optimization, https://github.com/cldellow/datasette-scraper#canonicalize_urlconfig-from_url-to_url-to_url_depth)