Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

bellingcat / wayback-google-analytics

Better solutions to web.archive.org rate limiting #21

Overview

Possible solutions

Incorporating a library w/ exponential delays

Custom solution