[Task Manager] Shift mechanism can cause a cascade throughout a cluster

gmmorris commented 3 years ago

In 7.11 we've introduced a self-balancing mechanism into Task Manager so that multiple Kibana can detect when their task claiming is causing _versionconflicts and shift their polling mechanism to avoid this.

While this has helped by improving the performance of the Alerting Framework, it has also introduced a new problem which is that Task Manager who shift can clash with other TMs who were running fine. When there is a large number of TMs (32 kibana for example) this can lead to a cascade of shifts across many instances.

We need to experiment with other ways to try shifting in order to reduce the likelyhood. Perhaps by making the average threshold higher, or avoiding a shift if conflicts were lower not that long ago, encouraging the recently shifted TM to shift again rather. than causing a cascade in which they all shift.

We should also add telemetry around this so we can get an idea of how this behaves out in the wild

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris commented 3 years ago

A couple of possible directions of investigation that came to me:

Perhaps we can change the mechanism so that the average version-clash required to shift is different for a Task Manager than has already shifted in the past few cycles than from a Task Manager that's experiencing clashes for the first time in a while? This would bias for constant shifting of one TM until it finds a good slot, instead of causing a cascade where both TMs shift.
Perhaps we can use a mechanism like the back pressure we use when TM experiences 429 errors? If version-clashes are high we actually slow the polling interval down (make it longer) until clashes reduce? The downsides to this are that you might end up with all TM's being spaced out to a point where the rate of task claiming is too slow and it will be harder to reason about the system as a whole as you could have wide variance in polling interval across the cluster.

It's worth implementing both and testing them against a series of perf test on cloud.... but that's hard as cloud doesn't easily support deployments from anything other than Main.

pmuellr commented 3 years ago

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

gmmorris commented 3 years ago

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s? Isn't it the same? 🤔 Since all TM's are running at a 3s interval.... in my head those sound the same....

pmuellr commented 3 years ago

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s? Isn't it the same?

Yeah, there is that :-). I guess I was thinking that if we only ADD delays, then (I think) we are adding a bit of latency somehow. And maybe if we also went backwards in time instead of forward, sometimes, and in very small increments, may have the same effect of "spreading things out" without any additional latency. I think tho, the way we are implementing the "shift" now, it's not really possible to go "backwards" in time anyway.

mikecote commented 3 years ago

Note from triage: this issue needs some research on what we should do to solve the problem and review it with the team.

gmmorris commented 3 years ago

I'm putting this on hold until I can test the result of this PR: https://github.com/elastic/kibana/pull/88210 Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

gmmorris commented 3 years ago

I'm putting this on hold until I can test the result of this PR: #88210 Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

Sadly this didn't have much of an impact. It looks like it might have reduce shifting a tad, but nothing impactful.

While running that additional cloud test, I ran a local experiment where I ran 8 kibana in parallel with a 500ms polling interval. It was quite easily to recreate this issue locally and visibly see the conflicts and the result of the over zealous shifting.

I tried a little hack in the shifting mechanism that basically keeps the mechanism as is but adds one little change: in addition to the p50 indicator, we also calculate the trend by comparing the average version_conflicts of the last few cycles to the few before it and avoid shifting if the trend is downwards. Locally I saw this reduce the shifting dramatically and after a minute of shifting back and forth all Kibana settled on a certain point in time where, for the most part, they were achieving a version_conflicts rate below our threshold.

Taking into account @bmcconaghy 's concerns I don't want to spend more time on this research, but I do think this small change is worth implementing and testing on cloud at higher numbers. Would love to hear thoughts.

bmcconaghy commented 3 years ago

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

gmmorris commented 3 years ago

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

Yeah, absolutely, the goal here is to make sure the existing mechanism reduces unnecessary noise, but the long term solution is definitely going to require some form of coordination between nodes.

ymao1 commented 2 years ago

@mikecote @kobelb Is there value in keeping this issue open since we are aware of the upper bound in the number of Kibanas that can run in parallel and it seems like we should address that larger issue instead of this one?

kobelb commented 2 years ago

To be honest, I'm indifferent :) I see some benefit from having this behavior documented, but I agree that it likely isn't actionable in isolation.

mikecote commented 2 years ago

I also agree to keep the issue open. It wouldn't hurt to document this behaviour, as the TM health API exposes information about this. It wouldn't hurt to also gather telemetry to understand the urgency of solving this in a larger manner.

elastic / kibana

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369