edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

What's the current rate limit for CDX search? #153

Closed itsrun closed 8 months ago

itsrun commented 8 months ago

Hi there, I'm currently sending search request every 1.25 seconds continuously but soon received 429 errors. May I ask what's the current recommended rate limit for the CDX search API? Thanks!

Mr0grog commented 8 months ago

Just to be clear, this isn't an official package from the Internet Archive, so for most questions not specifically about this Python package, you should contact them directly.

BUT I do try and keep in close contact with the staff there, and the current limit for requests to web.archive.org/cdx/*, the limit is 60 requests/minute averaged over a 5-minute window. Those limits are generally based on IP address, so if you are sharing an IP with someone else (e.g. if you are behind any kind of proxy or router, or working from a shared server), your requests will be grouped together for the purposes of rate limiting. Those limits are also different for particular IPs that have been allowed more or less because of past abuse or other issues.

If you are using this package, it does its best to stick to the limits for you automatically, but there are some significant issues we fixed around rate limits in the latest release (v0.4.4) and a complete overhaul of rate limits in the next release (v0.5.0, hopefully later this month 🤞) — so make sure you're on the latest version!

Also keep in mind that rate limits in this library are expressed in calls per second, so to make a request every 1.25s, you should configure:

client = WaybackClient(WaybackSession(search_calls_per_second=0.8))

And make sure to back off that value even more if you are using multiple clients on multiple threads. Also be careful not to create too many HTTP connections if you are multithreading! That'll be easier in v0.5.0, but in the current release, doing so is messy — see https://github.com/edgi-govdata-archiving/wayback/issues/106#issuecomment-1305986211.

Finally, once you receive a 429 response, make sure to stop all new requests immediately and do not start again for at least 60s. If you make new requests during that 60s window, your IP will get blocked for progressively longer time periods, from a few hours up to a few days.

itsrun commented 8 months ago

Thanks for the clear explanation! I'm running the script (single-threaded) from a GCP VM so I guess that's why it got rate limited so quickly