This PR addresses issue #19 and attempts to mitigate some of the 443 errors that result from large queries. It adds a semaphore of 10 in main.py that is inherited by functions called thereafter and adds a 5 second delay between each CDX api call when retrieving snapshots.
The program can handle much larger requests now, but there are still persistent issues when asking for very sizeable requests. As a result, there are now warning messages if a query is larger than 10 urls or asking for more than 500 archived snapshots per url.
Changelog
API errors
Added an asyncio.sleep in for each task passed into asyncio.gather() when processing urls. This helps break up the requests to the CDX api and prevents the rate limiting that was so prevalent previously.
Semaphore is now shared globally. This stops the runaway # of calls when there were >10 or so urls.
Readme
Added "Limitations & Rate Limits" section to warn against very large queries
Tweaked directions for running/installing from source
New Features
Added a flag called --skip_current - using this flag skips making requests to the current version of the website. This is useful for when you're looking at a large number of dead pages or only want to view historical data.
Added a warning/confirmation message for large requests. If a user passes >10 urls or a limit of >500, they get a warning message that encourages them to break their query into smaller pieces.
Known issues
While the CDX api issue is largely taken care of, issues with getting rate limited by web.archive.org itself remain. Most users are unlikely to encounter this when staying within the recommended limits.
Overview
This PR addresses issue #19 and attempts to mitigate some of the 443 errors that result from large queries. It adds a semaphore of 10 in
main.py
that is inherited by functions called thereafter and adds a 5 second delay between each CDX api call when retrieving snapshots.The program can handle much larger requests now, but there are still persistent issues when asking for very sizeable requests. As a result, there are now warning messages if a query is larger than 10 urls or asking for more than 500 archived snapshots per url.
Changelog
API errors
asyncio.sleep
in for each task passed intoasyncio.gather()
when processing urls. This helps break up the requests to the CDX api and prevents the rate limiting that was so prevalent previously.Readme
New Features
--skip_current
- using this flag skips making requests to the current version of the website. This is useful for when you're looking at a large number of dead pages or only want to view historical data.Known issues