edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

ClosedConnectionError & rate limiting #82

Open jordannickerson opened 2 years ago

jordannickerson commented 2 years ago

I apologize for the slight abuse of the term "Issues", as I don't think the problem I'm encountering is a true issue of your project.

While using wayback, I've run into issues with the connection being closed by the remote host. I've been performing a lot of search requests/pulling mementos, and suspect I'm hitting a rate limit. However, I have put a large delay between queries (5ish seconds).

Is there a best practice on how much we should throttle usage, and are there other things that we should do beyond just looping over all our searches with a time.sleep call to avoid slamming the server?

Mr0grog commented 2 years ago

No worries! TBH, I’ve lost track of the current rate limits the Wayback Machine imposes, but I think earlier this year it was at 10 requests/second for both CDX search (i.e. WaybackClient.search()) and mementos (WaybackClient.get_memento()).

If you are using multiple threads, you can do some messy stuff to share connections across threads, which has helped us reduce connection errors with Wayback in these code samples:

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L211-L252

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L473-L485

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L320-L323

That’s way over-complicated and I hope to get that functionality built-in to this package as part of #58.

You also might find some useful inspiration from other parts of the above script, which we use to pull in ~20 GB of data every night from Wayback. It’s really messy and a bit hard to follow, though. (It’s had a lot of iterations but limited time to really clean it up over the last few years, and is what this package was originally extracted from.)

(Sorry about the slow feedback here, @jordannickerson. I’ve been semi-offline for the last couple weeks.)

Mr0grog commented 8 months ago

Quick update: I’m considering this a duplicate of #58, which I am pretty committed to actually solving this month.