Introduce global rate control in control flow

simboss commented 5 years ago

I would want the ability to set a global rate control that would apply by default to all users and ip for which there is no specific setting.

When there a specific setting that override the global one so that we can either have some users/IPs providing better throughput or worse throughput than normal users.

I would need an estimate first and a discussion.

aaime commented 5 years ago

Control-flow has no notion of "everything else", rules are applied if they match their filter, not applied if they don't. The order in which the config file is written also does not matter, rules have a priority based on how restrictive the rule is, they get sorted and applied in sequence, from most strict to less (e.g, ows.global is normally applied last). We could likely change that (the sorting) so that we can maybe start talking about "everything else", but it would invalidate existing configuration files

What we could have without major changes is a rule for the non authenticated user, writing one for "any other IP" might be a problem instead, see above.

Also, just to be sure we are talking about the same rate flow control notion, the control flow rate can be configured in two ways:

If more requests than the configured max per unit of time are received, they get rejected with a specific HTTP code
Same as above, but the request is not rejected, it's just delayed by a fixed amount of time The second to be effective would need to be coupled with a limit of concurrent execution per user, which we don't have. So I'm guessing you are going to use the first, yes? :-)

simboss commented 5 years ago

Let me clarify this better, I would want to be able to specify rules like this:

global.ows[.[.[.]]]=/[;s]

which apply globally to GeoServer, not per user, not per IP. From an operations perspective it is way better to use rules like this rather than limiting the #of parallel requests being executed to protect GeoServer from spikes.

Ideally then I would like to be able also to have specific rules like:

user.ows[.[.[.]]]=/[;s] ip.ows[.[.[.]]]=/[;s]

that override the global one.

I don't get the second point in your comment. In my understanding I can use the parameter to decide if requests exceeding the limit must be dropped or delayed. I would stick to that.

aaime commented 5 years ago

The point is that in control flow there are no overrides, nor "everything else" matchers, each rule is applied in turn.

Say I have a rule stating that the single user cannot do more than 10 requests a minute, and another that globally no more than 100 requests per minute can be done (for simplicity, both would reject the request past the limit).

Control flow sorts the rules by how strict the are, so the user one is applied before the global one. The user has made no requests, so the user one lets it pass, then the global one is applied (see above, all matching rules are applied) and maybe globally 100 requests have been done already, so the request is rejected.

About using rate limiting rather than parallel execution being better, I respectfully disagree, a server can serve many little request (tiles on an empty area for example) with very little effort, but few complex requests in parallel can crush it (it's not the rate that counts, it's what's inside the request... rate does not tell apart how hard it is to serve a request). I see rate control as somewhat useful in terms of creating different sets of users based on how much they pay for a service (needs to be coupled with a concurrency control on the same user), but what ensures that the server can actually stay up and running is the concurrency control.

About the second comment about reject vs delay, yep, we are on the same line.

aaime commented 5 years ago

Oh, it just occurred to me that likely we need to estimate something else too: distributed counters. On a cluster like maps, the requests are spread among the nodes, hopefully using a "business" policy (the sensible one for OGC services, where some requests are cheap and others very expensive)... right now each GS instance keeps its own counters, one might exceed the rate control and others do not, making what you receive back a "russian roulette" case, one note tells you that you have exceeded the limits, another answers you and tells you that you still have N requests for the current unit of time....

Generally speaking, in a cluster the "general" OGC concurrency checks still make sense, because they can be referred to the stability of the single server, but per user, per IP and rate controls should use distributed/shared counters instead.

aaime commented 5 years ago

Thinking out loud... aren't we going in a direction that is maybe better managed by an external API gateway? Control-flow works well for what it was designed for, keeping a single instance from breaking on excessive load. It can be extended to do other stuff, and an external API gateway might have troubles telling apart the types of OGC requests (some are opinionated and handle REST requests only, I'm guessing none would handle OGC requests without some extensive configuration), but I'd give it a consideration at least.

simboss commented 5 years ago

Don't go too far, this is just for protection of a single instance.

In most installations we don't control the client but the requests we get are somehow predictable so we know pretty well what is throughput we can handle on a per instance basis.

Next step is to think about how to differentiate requests where we hit the cache from requests where we don't. I will open a separate discussion.

For the moment I would stick to estimating the following:

I would want to be able to specify rules like this global.ows[.[.[.]]]=/[;s] which apply globally to GeoServer. Specific rules like user.ows[.[.[.]]]=/[;s] orip.ows[.[.[.]]]=/[;s] would apply with the current behavior you described above with your 10 to 100 example. I can use the delay parameter to decide if requests exceeding the limit must be dropped or delayed. I would stick to that.

aaime commented 5 years ago

Well that makes things easier... much easier actually, I think a rate control based on the OGC prefix (any one, restricting to global.ows and not allowing for example ows.wms.getmap won't make it actually easier) can be done in one day.

geosolutions-it / geoserver

Introduce global rate control in control flow #183