cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

plugin: seed-sitemaps #12

Closed cldellow closed 1 year ago

cldellow commented 1 year ago
      // A set of domains whose sitemaps will be discovered via
      // robots.txt, and used to discover URLs (optional)
      "seed-sitemaps": ["news.ycombinator.com"]

Needs https://github.com/cldellow/datasette-scraper#get_seed_urlsscraper-config, https://github.com/cldellow/datasette-scraper#discover_urlsscraper-config-url-response

Returns array of sitemap URLs, but also knows how to discover new URLs from the sitemaps