buren / wayback_archiver

Ruby gem to send URLs to Wayback Machine
https://rubygems.org/gems/wayback_archiver
MIT License
57 stars 11 forks source link

avoid 429, ignore robots #33

Closed bartman081523 closed 3 years ago

bartman081523 commented 4 years ago

avoid 429, ignore robots: lowerded concurrency to 1, sleep 5 seconds before next request. the other patch is to ignore robots, as of these: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

bartman081523 commented 3 years ago

41

buren commented 3 years ago

@chlorophyll-zz the only diff between master and this PR is now the sleep call that you added in the Spidr.every_page { ... } block.

There are two types of 429, Too Many Requests that can happen. One is when crawling for URLs to send in the Wayback Machine, the one you've addressed in the Spidr.every_page { ... } block, the other is when posting URLs to the Wayback Machine (currently unhandled).

In my experience seeing 429, Too Many Requests errors when crawling for URLs to send are very rare, posting URLs to Wayback Machine however that has very aggressive rate limiting those are very easy to run into. That's the reason why the concurrency has been dropped in this gem from five to one. Dropping the concurrency to one has fixed all those errors for me, though I haven't tested a ton, but perhaps a configurable sleep length could be implemented, though that would need to be added to the Archive class.

Thanks for your PR and sorry for being so slow 😄

buren commented 3 years ago

v1.4.0 available on RubyGems here's the CHANGELOG.