Closed bartman081523 closed 3 years ago
@chlorophyll-zz the only diff between master
and this PR is now the sleep
call that you added in the Spidr.every_page { ... }
block.
There are two types of 429, Too Many Requests
that can happen. One is when crawling for URLs to send in the Wayback Machine, the one you've addressed in the Spidr.every_page { ... }
block, the other is when posting URLs to the Wayback Machine (currently unhandled).
In my experience seeing 429, Too Many Requests
errors when crawling for URLs to send are very rare, posting URLs to Wayback Machine however that has very aggressive rate limiting those are very easy to run into. That's the reason why the concurrency has been dropped in this gem from five to one. Dropping the concurrency to one has fixed all those errors for me, though I haven't tested a ton, but perhaps a configurable sleep
length could be implemented, though that would need to be added to the Archive
class.
Thanks for your PR and sorry for being so slow 😄
v1.4.0 available on RubyGems here's the CHANGELOG.
avoid 429, ignore robots: lowerded concurrency to 1, sleep 5 seconds before next request. the other patch is to ignore robots, as of these: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/