bellingcat / wayback-google-analytics

A lightweight tool for scraping current and historic Google Analytics data
https://pypi.org/project/wayback-google-analytics/
MIT License
187 stars 22 forks source link

Connection error if url list > ~10 #19

Closed jclark1913 closed 10 months ago

jclark1913 commented 10 months ago

Overview

The project uses aiohttp and asyncio to gather and run large numbers of tasks asynchronously. This is great for running 10 or so urls as it is lightning fast, but large numbers of urls can still cause us to get rate limited and see a aiohttp.client_exceptions.ClientConnectionError error.

Solution

The easiest and probably best solution is to add a semaphore to limit concurrent requests. This actually already happens when processing urls (when requesting snapshots from the CDX url, for example) but there's no limit at a higher level. The final semaphore limit should probably be set to somewhere between 10-20 and will take a bit of tweaking to find the balance between performance and not getting rate limited.

jclark1913 commented 10 months ago

Resolved with PR #20