The project uses aiohttp and asyncio to gather and run large numbers of tasks asynchronously. This is great for running 10 or so urls as it is lightning fast, but large numbers of urls can still cause us to get rate limited and see a aiohttp.client_exceptions.ClientConnectionError error.
Solution
The easiest and probably best solution is to add a semaphore to limit concurrent requests. This actually already happens when processing urls (when requesting snapshots from the CDX url, for example) but there's no limit at a higher level. The final semaphore limit should probably be set to somewhere between 10-20 and will take a bit of tweaking to find the balance between performance and not getting rate limited.
Overview
The project uses aiohttp and asyncio to gather and run large numbers of tasks asynchronously. This is great for running 10 or so urls as it is lightning fast, but large numbers of urls can still cause us to get rate limited and see a
aiohttp.client_exceptions.ClientConnectionError
error.Solution
The easiest and probably best solution is to add a semaphore to limit concurrent requests. This actually already happens when processing urls (when requesting snapshots from the CDX url, for example) but there's no limit at a higher level. The final semaphore limit should probably be set to somewhere between 10-20 and will take a bit of tweaking to find the balance between performance and not getting rate limited.