Open DaveCTurner opened 1 year ago
Pinging @elastic/es-distributed (Team:Distributed)
A couple of questions since this issue came up recently.
shouldRateLimit
method to ClusterStateTaskExecutor
and ClusterStateUpdateTask
to identify what is user-facing or not? Or that is too coarse?ClusterStateTaskExecutor
/ClusterStateUpdateTask
) and just failing the rest would do? Is someone aware of potential traps or corner-cases that a simple approach like that might run into?
I think there are some cases where we want to rate-limit some tasks that relate to a particular executor and not others, e.g. rate-limiting mapping updates to specific indices, so no I don't think we can just add a flag to the executor. We could perhaps make it task-specific by adding a flag to ClusterStateTaskListener
tho.
It might work to execute some tasks at a set rate (plus some burst capacity), but IME in these situations really we want to reject all the bad tasks until the onslaught stops.
We could perhaps make it task-specific by adding a flag to ClusterStateTaskListener tho
To me it seems the only difference it makes is that for the batched cs update tasks, I guess this only helps in letting batches through where none of the tasks could be rate-limited, otherwise if one in the batch gets rate-limited, we'll fail all tasks in the batch?
but IME in these situations really we want to reject all the bad tasks until the onslaught stops.
Does this mean just a fixed window with a limit on number of use-facing tasks? (probably a sliding window)
To me it seems the only difference it makes is that for the batched cs update tasks
I think we wouldn't even count non-throttled tasks when computing the rate of task execution.
Does this mean just a fixed window with a limit on number of use-facing tasks? (probably a sliding window)
I think it means we would want to count rejected tasks when computing the rate of task execution.
Thanks. I think what I'm trying to understand is what happens to the batch of tasks we're going through when there are both "user-facing" and non-"user-facing" tasks in it. Do we have to ensure that all tasks in the batch are successfully executed to consider the resulting state from the batch execution valid? Or can we selectively skip the user-facing ones if we are at the moment in throttling mode?
I don't think I got the second part though, I wanted to know when do we decide to start rejecting tasks that have the flag. I can also open a very crude draft PR and we can discuss there.
If a batch has any non-user-facing tasks then we need to execute it. If that tips us over into a throttling state then we can start to reject further user-facing tasks before we enqueue them into a batch.
Today it's still all too easy for a relatively light stream of cluster-state-update-triggering user actions to cause starvation at the master service, preventing other cluster state updates from proceeding. Almost always it's one kind of cluster state update happening repeatedly but failing to have the desired effect, causing clients to retry forever. Common culprits include mapping updates and template creation.
A common protection against starvation in a priority scheduler is aging but IMO this doesn't really help us here: it seems pretty much impossible to come up with an aging scheme that allows the other tasks to complete promptly without inverting priorities far too quickly for normal operation.
Instead I think we should consider rate-limiting of these user-facing operations to allow us to detect if we're in a spammy situation and simply reject the work until the spam abates. I believe we'd need to implement this in the
MasterService
in order to take account of batching properly.