cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

hook: before_fetch_url #28

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

before_fetch_url(scraper, config, url, request_headers)

request_headers is a dict, you can modify it to control what gets sent in the request.

Returns:

Note before_fetch_url vs canonicalize_url

You can also use the canonicalize_url hook to reject URLs prior to them entering the crawl queue.

A URL rejected by canonicalize_url will not result in an entry in the dss_crawl_queue and dss_crawl_queue_history tables.

Which one you use is a matter of taste, in general, if you never want the URL, reject it at canonicalization time.