41

buren commented 3 years ago

@chlorophyll-zz the only diff between master and this PR is now the sleep call that you added in the Spidr.every_page { ... } block.

There are two types of 429, Too Many Requests that can happen. One is when crawling for URLs to send in the Wayback Machine, the one you've addressed in the Spidr.every_page { ... } block, the other is when posting URLs to the Wayback Machine (currently unhandled).

In my experience seeing 429, Too Many Requests errors when crawling for URLs to send are very rare, posting URLs to Wayback Machine however that has very aggressive rate limiting those are very easy to run into. That's the reason why the concurrency has been dropped in this gem from five to one. Dropping the concurrency to one has fixed all those errors for me, though I haven't tested a ton, but perhaps a configurable sleep length could be implemented, though that would need to be added to the Archive class.

Thanks for your PR and sorry for being so slow 😄

buren commented 3 years ago

v1.4.0 available on RubyGems here's the CHANGELOG.

buren / wayback_archiver

avoid 429, ignore robots #33

41