search_calls_per_second needs to be dialed down

edsu commented 10 months ago

I was running some fairly simple data retrieval in this Notebook(see the Wayback section) and I discovered that I got completely blocked from accessing web.archive.org! Luckily I remembered that there was Internet Archive's #wayback-researchers Slack channel, where I got this reponse.

Hi edsu - I found your /cdx requests from 4:29UTC. Those requests are limited to an average of 60/min. Over that and we start sending 429s. If 429s are ignored for more than a minute we block the IP at the firewall (no connection) for 1 hour, which is what happened to you. Subsequent 429s over a given period will double that time each occurrence. If you can keeping your api request < 60/minute you will prevent this from happening.

I thought that the openwayback module's defaults would have prevented me from going over the 60 requests per minute (one per second) and I thought wayback's support for handling 429 responses would have backed off sufficiently fast. I suspect that the goal posts on the server side have changed recently because I had code that worked a month or so ago, which stopped working (resulting in the block).

I was able to get around this by using a custom WaybackSession where I set the search_calls_per_second to 0.5, but I suspect 1.0 would probably work better. Maybe the default could me moved down from 1.5 to 1.0?

Also, perhaps there needs to be some logic to make sure to wait a minute when encountering a 429 as well?

Mr0grog commented 10 months ago

Keeping this open to track other adjustments mentioned here to handle things better in general here (e.g. when we hit a 429) rather than just adjusting the limits (done in #140).

This is also tied in with a long-running goal I’ve had to revamp rate limiting, since part of the problem here is that we implement the throttle on calls to search() and get_memento() instead of in the actual send() method (or around calls to it), where it should be happening. (Elsewhere in EDGI code, we have a nice RateLimit object that works like a lock that I’d like to use here, so we can pass a rate limit into send() or otherwise share a single lock around the codebase as needed.)

Mr0grog commented 10 months ago

Sketching out some more detailed thoughts here:

Should bring in the RateLimit class from web-monitoring-processing and have the session hold two instances of that (one for the CDX server and one for the Memento server, since they do, or at least did, have independent rate limits) instead of just integers.
How do the above actually get applied? Some options:
1. Used automatically in send() based on the URL requested. Pros: easy to implement (otherwise there is a layer of indirection between the client methods and send() — they call request(), which prepares a request to send() and I have to think about how to get through that); don’t have to worry about calling code making mistakes. Cons: maybe too magic, hard to override. Doesn’t give us a way to burst for requests that go together (e.g. following redirects from a memento request).
2. Have the client pass the rate limit to use into the session when making a request. This keeps the current model, where client methods are in charge of what types of requests get which limits, and means we don’t have to magically associate limits with URLs, just with use cases. But we need to figure out how to actually pass that info all the way down to send().
3. Have the client use the limits directly. This is kind of like today, but we just need to move where the limits get applied to be directly around calls to request() or send() rather than at the start of methods like get_memento() or search(). This requires those high-level methods to be more careful (and probably more error prone), but keeps the application of the limits in the place where we have more contextual information to do interesting things with (e.g. if we wanted to add the ability to burst briefly for requests that go together).
I suspect (2) may present pretty hard-to-address logistical hurdles, so either (1) or (3) are better (at least until we get rid of requests — see #58). I like that (1) just works, but (3) gives us more flexibility.
Consider adding some kind of burst-ability to the RateLimit object (e.g. for CDX, the real budget is 60/min, not 1/s, so you could do a few requests faster than 1/s as long as you don’t blow the 1-minute budget). I’m not sure this is that useful for CDX, but it would be super useful for mementos. We get around this right now by not implementing rate limits correctly (that is, implementing them at the call to get_memento() and not to the actual requests that method makes).
Add a more restrictive lock to WaybackSession to handle 429s. The idea here is that a 429 response would lock the session for a little while, and maybe subsequent 429s would lock the session for longer if they have a Retry-After header (IIRC, we do not currently obey Retry-After because we discovered they were frequently a lie and excessively long, but it’s possible that’s no longer true.)

Note some care needs to be taken here to respect timeouts. If the retry is going to wait for too long, we may just want to abandon early. Or make timeouts predictable by waiting until the timeout and then raising. Dunno. (We may also want to add another value to the tuple version of timeouts that controls the overall timeout, rather than just the connect and read timeouts.)

Mr0grog commented 9 months ago

@edsu while doing some final cleanup on the patch release for this (sorry for the delays; I’ve been a bit distracted lately), I was reading the docstring for RateLimitError and it certainly sounds like I had intended to raise on any 429 response instead of just when we’ve exceeded the maximum number of retries like we currently do. Thinking about it now, it feels like that might be more correct, too.

Raising instead of retrying seemed like a big change for v0.4.4, so for that release I just made a longer retry delay (see #142). But I’m thinking about making it always raise instead of ever retrying (responses with status code 429) in v0.5.0.

As a user of this tool, do you have any thoughts on that? Would you prefer the current behavior (retrying but with a very long delay) instead? (Another option might be to make this configurable, but I’d prefer to avoid that complexity and just choose one behavior or the other.)

edsu commented 9 months ago

Yes, I like raising an exception when a 429 is encountered. I think it will help wayback users avoid a situation where they are getting blocked?

Mr0grog commented 9 months ago

Great, that’s about what I was thinking as well. (I think the reason it was being handled here in the first place was that, at EDGI, we used to use this as part of a nightly bulk processing script that took a few hours to run, and having it automatically handle rate limits would have been useful there since you probably don’t want a job that big to stop in the middle.)

Mr0grog commented 9 months ago

Where we are on 429 responses:

v0.4.4 with an automatic one minute pause before retrying (in cases where these responses were already being retried).
v0.5.0 (the next release) will always raise and never retry when receving 429 responses. The code for this is already merged in.

In terms of refactoring rate limits:

I suspect (2) may present pretty hard-to-address logistical hurdles, so either (1) or (3) are better (at least until we get rid of requests…

Another alternative here I hadn’t thought through before is a bigger refactor — have separate sessions for endpoints/URL patterns that have different rate limits (so one WaybackClient might hold multiple WaybackSession objects). This could be a better way to organize things, but I am somewhat concerned that separates things too much. I suspect we may want a single connection pool across all requests to the https://web.archive.org origin, even though we may have different rate limits per path. I’ve posted on the Internet Archive Slack to see if I can get some more official, up-to-date details on this stuff.

Also, on reflection, I think option (1) (WabackSession automatically picks the right rate limit to apply inside send()) is the only currently practical option until we drop our dependency on requests (see #58). Rate limits need to apply to retries and therefore have to be applied inside send(), so (3) is out, and (2) is indeed impractical to correctly implement on top of requests.Session.

(All that said, I plan to try and do #58 this month, too, which will make the above moot. BUT I expect that issue to be bigger and more complex than it sounds, and don’t want to force the changes here to wait for that to be done first.)

Mr0grog commented 9 months ago

OK, got some more details from the inside. Limits are calculated over a 5 minute period. Hard limits are:

/cdx/*: 60/min
/web/timemap/*: 100/min (but this one has been more malleable than others in the past)
/web/NNNNNNNN*/*: 600/min

But they’d like us to set the defaults at 80% of that in the general case. So we should use:

/cdx/*: 48/min
/web/timemap/*: 80/min (but this one has been more malleable than others in the past)
/web/NNNNNNNN*/*: 480/min

429 responses should stop everything for 60 seconds. We now raise a RateLimitError exception here, so I’m not worried too much about this, but I do wonder if we should have a way to lock the rate limit object for that long when raising RateLimitError, so all threads using the same rate limiters get stopped. (Alternatively we leave it up to users to do any thread coordination if they get a RateLimitError.)

Mr0grog commented 9 months ago

This also leaves me wondering if I should change the API so that instead of naming each rate limit:

WaybackSession(
  search_calls_per_second=RateLimit(1),
  memento_calls_per_second=RateLimit(8)
)

We just have a dict of URL prefixes and rate limits:

WaybackSession(rate_limits={
  '/cdx/': RateLimit(1),
  '/web/': RateLimit(8),
  '/': RateLimit(10),
})

This is probably a little less clear for users (now you have to know what URLs the various services live at if you want to make adjustments or understand how the rate limits apply to calls to get_memento() or search()), but it is more direct and flexible.

Otherwise, I think I may at least deprecate the old argument names and change them to the less verbose:

WaybackSession(
  rate_search=RateLimit(1),
  rate_memento=RateLimit(8)
)

edsu commented 9 months ago

I like the route specific names. It should make it easier to connect the limit with the endpoint that is being used?

Mr0grog commented 9 months ago

Thanks! 🙏 I wasn’t sure whether it was better to clarify what client method is getting what limits, or what actual endpoints are getting what limits. If you are more concerned about the latter, that is definitely a good reason to change it up.

Mr0grog commented 9 months ago

I suppose the other complicated thing here is that we need to preserve the defaults for different services, so whatever you pass in will be treated like:

rate_limits | {
    '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,  # shared instance of RateLimit(1.33)
    '/cdx': DEFAULT_RATE_LIMIT_CDX,              # shared instance of RateLimit(0.8)
    '/': DEFAULT_RATE_LIMIT,                     # shared instance of RateLimit(8)
}

So you get this behavior (mostly straightforward, but maybe a bit weird):

WaybackSession(rate_limits = { '/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': 10,
# }

But this more complicated behavior if you make a typo:

WaybackSession(rate_limits = { '/cdx/': 10 })
# Resulting limits:
# {
#     '/web/timemap': DEFAULT_RATE_LIMIT_TIMEMAP,
#     '/cdx/': 10,
#     '/cdx': DEFAULT_RATE_LIMIT_CDX,
#     '/': DEFAULT_RATE_LIMIT,
# }

(I guess we could also restrict which prefixes can be given limits and raise an error if you try to set a disallowed one, but that kills some of the flexibility here, and I feel like the nice names are less error-prone and more statically checkable and editor-assistable/tab-completable at that point.)

Mr0grog commented 8 months ago

Thinking about the above some more, I don’t think I’m going to make any changes to the WaybackSession arguments right now. Providing a dictionary with paths might seem more straightforward in the simplest cases, but it makes customizing some rates while leaving other defaults in place more complicated. More importantly, it’s harder to make non-breaking tweaks or changes in the library — the default paths cease to be tweakable and I’m not confident we won’t want/need to tweak them (we could do things like provide constants for the paths, but that is both error-prone and unnecessarily complicated for users in my experience).

There’s definitely room for rethinking this going forward but I’m not comfortable with including it in the other, more immediate work around rate limiting that this issue is focused on.

edgi-govdata-archiving / wayback

search_calls_per_second needs to be dialed down #137