Closed cldellow closed 1 year ago
before_fetch_url(scraper, config, url, request_headers)
request_headers is a dict, you can modify it to control what gets sent in the request.
request_headers
Returns:
Note before_fetch_url vs canonicalize_url You can also use the canonicalize_url hook to reject URLs prior to them entering the crawl queue. A URL rejected by canonicalize_url will not result in an entry in the dss_crawl_queue and dss_crawl_queue_history tables. Which one you use is a matter of taste, in general, if you never want the URL, reject it at canonicalization time.
Note before_fetch_url vs canonicalize_url
before_fetch_url
canonicalize_url
You can also use the canonicalize_url hook to reject URLs prior to them entering the crawl queue.
A URL rejected by canonicalize_url will not result in an entry in the dss_crawl_queue and dss_crawl_queue_history tables.
dss_crawl_queue
dss_crawl_queue_history
Which one you use is a matter of taste, in general, if you never want the URL, reject it at canonicalization time.
before_fetch_url(scraper, config, url, request_headers)
request_headers
is a dict, you can modify it to control what gets sent in the request.Returns: