Overview

This PR addresses issue #19 and attempts to mitigate some of the 443 errors that result from large queries. It adds a semaphore of 10 in main.py that is inherited by functions called thereafter and adds a 5 second delay between each CDX api call when retrieving snapshots.

The program can handle much larger requests now, but there are still persistent issues when asking for very sizeable requests. As a result, there are now warning messages if a query is larger than 10 urls or asking for more than 500 archived snapshots per url.

Changelog

API errors

Added an asyncio.sleep in for each task passed into asyncio.gather() when processing urls. This helps break up the requests to the CDX api and prevents the rate limiting that was so prevalent previously.
Semaphore is now shared globally. This stops the runaway # of calls when there were >10 or so urls.

Readme

Added "Limitations & Rate Limits" section to warn against very large queries
Tweaked directions for running/installing from source

New Features

Added a flag called --skip_current - using this flag skips making requests to the current version of the website. This is useful for when you're looking at a large number of dead pages or only want to view historical data.
Added a warning/confirmation message for large requests. If a user passes >10 urls or a limit of >500, they get a warning message that encourages them to break their query into smaller pieces.

Known issues

While the CDX api issue is largely taken care of, issues with getting rate limited by web.archive.org itself remain. Most users are unlikely to encounter this when staying within the recommended limits.

bellingcat / wayback-google-analytics

Add better rate limit protection (solves #19) #20