ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Automatically slow down for a domain on 429 Too many requests #112

Open ivan opened 6 years ago

ivan commented 6 years ago

Write a better default wait_time hook that does this; see https://github.com/ludios/grab-site/issues/59#issuecomment-343104125 for an example.

ivan commented 6 years ago

Possible implementation strategy:

Implement #59 so that the user can easily adjust delays on a per-domain basis.

For each 429 response, add (# of connections being used * 1 second) to the wait time for that domain in the delay_regexps file e.g. 2000 ^https?://foo\.tld/, then rewrite delay_regexps.