cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

hook: discover_urls #30

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

discover_urls(scraper, config, url, response)

Returns a list of URLs to crawl.

The URLs can be either strings, in which case they'll get enqueued as depth + 1, or tuple of URL and depth. This can be useful for paginated index pages, where you'd like to crawl to a max depth of, say, 2, but treat all the index pages as being at depth 1.

Note

Sneaky plugins can abuse this hook to stash the response somewhere so that future runs can avoid hitting the origin server. If link discovery and extraction ever become a multiprocess thing, we'll add an explicit after_fetch_url hook.