hook: before_fetch_url - Githubissues

cldellow / datasette-scraper

Add website scraping abilities to Datasette

Apache License 2.0

60 stars 1 forks source link

hook: before_fetch_url #28

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

`before_fetch_url(scraper, config, url, request_headers)`

request_headers is a dict, you can modify it to control what gets sent in the request.

Returns:

truthy to indicate this URL should not be crawled (for example, crawl max page limit)
falsy to express no opinion

Note before_fetch_url vs canonicalize_url

You can also use the canonicalize_url hook to reject URLs prior to them entering the crawl queue.

A URL rejected by canonicalize_url will not result in an entry in the dss_crawl_queue and dss_crawl_queue_history tables.

Which one you use is a matter of taste, in general, if you never want the URL, reject it at canonicalization time.