bellingcat / wayback-google-analytics

A lightweight tool for scraping current and historic Google Analytics data
https://pypi.org/project/wayback-google-analytics/
MIT License
193 stars 23 forks source link

Better solutions to web.archive.org rate limiting #21

Open jclark1913 opened 1 year ago

jclark1913 commented 1 year ago

Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

msramalho commented 1 year ago

So I recently discovered that the cdx api has the following rate limit logic: Requests are limited to an average of 60/min. Over that and we start getting 429s. If 429s are ignored for more than a minute the IP gets blocked for 1 hour. Subsequent 429s over a given period will double that time each occurrence. So ideally, If we can keep the api request < 60/minute we will prevent this from happening.