hook: discover_urls - Githubissues

`discover_urls(scraper, config, url, response)`

TODO: probably this wants access to a (lazily) parsed form of the response ?

Returns a list of URLs to crawl.

The URLs can be either strings, in which case they'll get enqueued as depth + 1, or tuple of URL and depth. This can be useful for paginated index pages, where you'd like to crawl to a max depth of, say, 2, but treat all the index pages as being at depth 1.

Note

Sneaky plugins can abuse this hook to stash the response somewhere so that future runs can avoid hitting the origin server. If link discovery and extraction ever become a multiprocess thing, we'll add an explicit after_fetch_url hook.

cldellow / datasette-scraper

hook: discover_urls #30

`discover_urls(scraper, config, url, response)`