Rate-limiting for user-facing cluster state updates

DaveCTurner commented 1 year ago

Today it's still all too easy for a relatively light stream of cluster-state-update-triggering user actions to cause starvation at the master service, preventing other cluster state updates from proceeding. Almost always it's one kind of cluster state update happening repeatedly but failing to have the desired effect, causing clients to retry forever. Common culprits include mapping updates and template creation.

A common protection against starvation in a priority scheduler is aging but IMO this doesn't really help us here: it seems pretty much impossible to come up with an aging scheme that allows the other tasks to complete promptly without inverting priorities far too quickly for normal operation.

Instead I think we should consider rate-limiting of these user-facing operations to allow us to detect if we're in a spammy situation and simply reject the work until the spam abates. I believe we'd need to implement this in the MasterService in order to take account of batching properly.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi commented 1 week ago

A couple of questions since this issue came up recently.

would it be enough to just add a shouldRateLimit method to ClusterStateTaskExecutor and ClusterStateUpdateTask to identify what is user-facing or not? Or that is too coarse?
As far as the rate limiting behaviour itself goes, how sophisticated does this need to be? Would something simple like allowing M user-facing updates within N seconds (or per each ClusterStateTaskExecutor/ClusterStateUpdateTask) and just failing the rest would do?

Is someone aware of potential traps or corner-cases that a simple approach like that might run into?

DaveCTurner commented 4 days ago

I think there are some cases where we want to rate-limit some tasks that relate to a particular executor and not others, e.g. rate-limiting mapping updates to specific indices, so no I don't think we can just add a flag to the executor. We could perhaps make it task-specific by adding a flag to ClusterStateTaskListener tho.

It might work to execute some tasks at a set rate (plus some burst capacity), but IME in these situations really we want to reject all the bad tasks until the onslaught stops.

pxsalehi commented 2 days ago

We could perhaps make it task-specific by adding a flag to ClusterStateTaskListener tho

To me it seems the only difference it makes is that for the batched cs update tasks, I guess this only helps in letting batches through where none of the tasks could be rate-limited, otherwise if one in the batch gets rate-limited, we'll fail all tasks in the batch?

but IME in these situations really we want to reject all the bad tasks until the onslaught stops.

Does this mean just a fixed window with a limit on number of use-facing tasks? (probably a sliding window)

DaveCTurner commented 2 days ago

To me it seems the only difference it makes is that for the batched cs update tasks

I think we wouldn't even count non-throttled tasks when computing the rate of task execution.

Does this mean just a fixed window with a limit on number of use-facing tasks? (probably a sliding window)

I think it means we would want to count rejected tasks when computing the rate of task execution.

pxsalehi commented 2 days ago

Thanks. I think what I'm trying to understand is what happens to the batch of tasks we're going through when there are both "user-facing" and non-"user-facing" tasks in it. Do we have to ensure that all tasks in the batch are successfully executed to consider the resulting state from the batch execution valid? Or can we selectively skip the user-facing ones if we are at the moment in throttling mode?

I don't think I got the second part though, I wanted to know when do we decide to start rejecting tasks that have the flag. I can also open a very crude draft PR and we can discuss there.

DaveCTurner commented 2 days ago

If a batch has any non-user-facing tasks then we need to execute it. If that tips us over into a throttling state then we can start to reject further user-facing tasks before we enqueue them into a batch.

elastic / elasticsearch

Rate-limiting for user-facing cluster state updates #99285