edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

search_calls_per_second needs to be dialed down #137

Closed edsu closed 8 months ago

edsu commented 10 months ago

I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.

Hi edsu - I found your /cdx requests from 4:29UTC. Those requests are limited to an average of 60/min. Over that and we start sending 429s. If 429s are ignored for more than a minute we block the IP at the firewall (no connection) for 1 hour, which is what happened to you. Subsequent 429s over a given period will double that time each occurrence. If you can keeping your api request < 60/minute you will prevent this from happening.

I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).

I was able to get around this by using a custom WaybackSession where I set the search_calls_per_second to 0.5, but I suspect 1.0 would probably work better. Maybe the default could me moved down from 1.5 to 1.0?

Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?

Mr0grog commented 10 months ago

Keeping this open to track other adjustments mentioned here to handle things better in general here (e.g. when we hit a 429) rather than just adjusting the limits (done in #140).

This is also tied in with a long-running goal I’ve had to revamp rate limiting, since part of the problem here is that we implement the throttle on calls to search() and get_memento() instead of in the actual send() method (or around calls to it), where it should be happening. (Elsewhere in EDGI code, we have a nice RateLimit object that works like a lock that I’d like to use here, so we can pass a rate limit into send() or otherwise share a single lock around the codebase as needed.)

Mr0grog commented 10 months ago

Sketching out some more detailed thoughts here:

Mr0grog commented 9 months ago

@edsu while doing some final cleanup on the patch release for this (sorry for the delays; I’ve been a bit distracted lately), I was reading the docstring for RateLimitError and it certainly sounds like I had intended to raise on any 429 response instead of just when we’ve exceeded the maximum number of retries like we currently do. Thinking about it now, it feels like that might be more correct, too.

Raising instead of retrying seemed like a big change for v0.4.4, so for that release I just made a longer retry delay (see #142). But I’m thinking about making it always raise instead of ever retrying (responses with status code 429) in v0.5.0.

As a user of this tool, do you have any thoughts on that? Would you prefer the current behavior (retrying but with a very long delay) instead? (Another option might be to make this configurable, but I’d prefer to avoid that complexity and just choose one behavior or the other.)

edsu commented 9 months ago

Yes, I like raising an exception when a 429 is encountered. I think it will help wayback users avoid a situation where they are getting blocked?

Mr0grog commented 9 months ago

Great, that’s about what I was thinking as well. (I think the reason it was being handled here in the first place was that, at EDGI, we used to use this as part of a nightly bulk processing script that took a few hours to run, and having it automatically handle rate limits would have been useful there since you probably don’t want a job that big to stop in the middle.)

Mr0grog commented 9 months ago

Where we are on 429 responses:


In terms of refactoring rate limits:

I suspect (2) may present pretty hard-to-address logistical hurdles, so either (1) or (3) are better (at least until we get rid of requests…

Another alternative here I hadn’t thought through before is a bigger refactor — have separate sessions for endpoints/URL patterns that have different rate limits (so one WaybackClient might hold multiple WaybackSession objects). This could be a better way to organize things, but I am somewhat concerned that separates things too much. I suspect we may want a single connection pool across all requests to the https://web.archive.org origin, even though we may have different rate limits per path. I’ve posted on the Internet Archive Slack to see if I can get some more official, up-to-date details on this stuff.

Also, on reflection, I think option (1) (WabackSession automatically picks the right rate limit to apply inside send()) is the only currently practical option until we drop our dependency on requests (see #58). Rate limits need to apply to retries and therefore have to be applied inside send(), so (3) is out, and (2) is indeed impractical to correctly implement on top of requests.Session.

(All that said, I plan to try and do #58 this month, too, which will make the above moot. BUT I expect that issue to be bigger and more complex than it sounds, and don’t want to force the changes here to wait for that to be done first.)

Mr0grog commented 9 months ago

OK, got some more details from the inside. Limits are calculated over a 5 minute period. Hard limits are:

But they’d like us to set the defaults at 80% of that in the general case. So we should use:

429 responses should stop everything for 60 seconds. We now raise a RateLimitError exception here, so I’m not worried too much about this, but I do wonder if we should have a way to lock the rate limit object for that long when raising RateLimitError, so all threads using the same rate limiters get stopped. (Alternatively we leave it up to users to do any thread coordination if they get a RateLimitError.)

Mr0grog commented 9 months ago

This also leaves me wondering if I should change the API so that instead of naming each rate limit:

WaybackSession(
  search_calls_per_second=RateLimit(1),
  memento_calls_per_second=RateLimit(8)
)

We just have a dict of URL prefixes and rate limits:

WaybackSession(rate_limits={
  '/cdx/': RateLimit(1),
  '/web/': RateLimit(8),
  '/': RateLimit(10),
})

This is probably a little less clear for users (now you have to know what URLs the various services live at if you want to make adjustments or understand how the rate limits apply to calls to get_memento() or search()), but it is more direct and flexible.

Otherwise, I think I may at least deprecate the old argument names and change them to the less verbose:

WaybackSession(
  rate_search=RateLimit(1),
  rate_memento=RateLimit(8)
)
edsu commented 9 months ago

I like the route specific names. It should make it easier to connect the limit with the endpoint that is being used?

Mr0grog commented 9 months ago

Thanks! 🙏 I wasn’t sure whether it was better to clarify what client method is getting what limits, or what actual endpoints are getting what limits. If you are more concerned about the latter, that is definitely a good reason to change it up.

Mr0grog commented 9 months ago

I suppose the other complicated thing here is that we need to preserve the defaults for different services, so whatever you pass in will be treated like:

rate_limits | {
    '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,  # shared instance of RateLimit(1.33)
    '/cdx': DEFAULT_RATE_LIMIT_CDX,              # shared instance of RateLimit(0.8)
    '/': DEFAULT_RATE_LIMIT,                     # shared instance of RateLimit(8)
}

So you get this behavior (mostly straightforward, but maybe a bit weird):

WaybackSession(rate_limits = { '/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': 10,
# }

But this more complicated behavior if you make a typo:

WaybackSession(rate_limits = { '/cdx/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx/': 10,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': DEFAULT_RATE_LIMIT,
# }

(I guess we could also restrict which prefixes can be given limits and raise an error if you try to set a disallowed one, but that kills some of the flexibility here, and I feel like the nice names are less error-prone and more statically checkable and editor-assistable/tab-completable at that point.)

Mr0grog commented 8 months ago

Thinking about the above some more, I don’t think I’m going to make any changes to the WaybackSession arguments right now. Providing a dictionary with paths might seem more straightforward in the simplest cases, but it makes customizing some rates while leaving other defaults in place more complicated. More importantly, it’s harder to make non-breaking tweaks or changes in the library — the default paths cease to be tweakable and I’m not confident we won’t want/need to tweak them (we could do things like provide constants for the paths, but that is both error-prone and unnecessarily complicated for users in my experience).

There’s definitely room for rethinking this going forward but I’m not comfortable with including it in the other, more immediate work around rate limiting that this issue is focused on.