elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.59k stars 8.21k forks source link

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open gmmorris opened 3 years ago

gmmorris commented 3 years ago

We've seen certain rules run for excessively long duration, far exceeding what we had expected to see rules run. For example, we've seen large customers experience execution durations of well over 10 minutes (our general expectation was that rules would several seconds, definitely not several minutes).

This is concerning, as such a behaviour could have side effects we're not aware of. We should investigate the possible implications of this, and track how often this happens using logging and telemetry.

What might be happening?

Task Manager is configured to assume that a task (in this case, a rule task) has "timed out" after 5 minutes, at which point it tries to run it again.

Due to the distributed nature of Kibana instances, the attempt to rerun the rule task might be picked up by any Kibana instance (not necessarily the Kibana that ran it before). As we have no way of knowing if a rule is in fact still running or whether it has in fact crashed and "timed out", we assume the rule has in fact failed (as we had not expected a healthy rule task to ever run this long) and try to rerun it.

At that point, if the rule is in fact still running (and has simply exceeded 5 minutes), we likely end up with two instances of the same rule running twice in parallel. We aren't sure what the side effect of this might be, but likely one instance wil end up overwriting the result of the other - this is an unhealthy state, would likely have unexpected consequences and shouldn't happen.

Additionally, it's worth noting that most of the time this execution duration far exceeds the interval configured by the customer. This means that a rule might be configured to run every 1 minute, but ends up running every 11 minutes, or worse, every random amount of time above 10 minutes.

What should we do?

Feel free to change this list of actions, but this is what I think we should do of the top of my head:

Thoughts on guardrails

Adding guardrails around this at framework level is very difficult, but presumably our goal is to reduce the likelihood of this kind of thing happening.

Directions worth exploring:

  1. Can we tailor the retryAt of specific rule types to some kind of cluster level average? We track p90/p99 of each task type in Task Manager's health stats anyway, can we use that perhaps (when it tends to exceed 5m)?
  2. Can we prevent rule types that run that long from ever being released into production? We know these rules didn't run that long in Dev/Test, but they do run this long on large datasets in Prod, so how can we catch that sooner?
  3. Can we use some kind of preview mechanism to test the rule execution time during the rule creation step? I don't want to block a user for too long when creating the rule, but perhaps we can offer them the option to try it and that would surface the long execution time sooner?
  4. Can we cancel the rule when it exceeds the 5m execution time? We know we can't (currently) cancel the ES/SO queries performed by the implementation, but perhaps we can cancel it at framework level so that it's result doesn't actually get saved at framework level?
  5. Can we prevent a customer from setting an interval that is shorter than the time it took to run the preview (if we have a preview)?
  6. Can we prevent a customer from setting an interval that is shorter than the average time it takes to run these rule types in this cluster?
  7. Can we add UX in the rule creation flyout that tells users what the average execution time of a rule type is and warn them when their interval is below that average?

I'm sure there are more directions, but these feel like a god starting point.

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

ymao1 commented 3 years ago

Task Manager is configured to assume that a task (in this case, a rule task) has "timed out" after 5 minutes, at which point it tries to run it again.

Due to the distributed nature of Kibana instances, the attempt to rerun the rule task might be picked up by any Kibana instance (not necessarily the Kibana that ran it before). As we have no way of knowing if a rule is in fact still running or whether it has in fact crashed and "timed out", we assume the rule has in fact failed (as we had not expected a healthy rule task to ever run this long) and try to rerun it.

Validate the current behaviour (the above mentioned scenario was validated, but we haven't investigated the consequences yet) and update this issue

The description of the current behavior provided above is accurate. For recurring tasks (like all rules), task manager sets the retryAt date to whichever is larger, the configured schedule or 5 minutes. Note that this 5 minutes is the default timeout specified by task manager but it would be straightforward for the alerting framework to override this default value and even allow individual rule types to specify their own timeout value.

Another thing to note is that while multiple Kibana instances is specifically called out in the above description, this behavior occurs whether there is 1 or many Kibana instances.

The task claiming logic queries for tasks where status=idle & runAt has passed OR status=running|claiming & retryAt has passed.

Suppose a rule is configured to run every minute and consistently takes 10 minutes to run. The following would occur:

0:00 - Task manager claims task and kicks off execution. `retryAt` is set to 0:05.
0:01 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:02 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:03 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:04 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:05 - task instance is `running` and `retryAt` **has** passed; Task manager claims task and kicks off another task execution, `retryAt` is set to 0:10
0:06 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:07 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:08 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:09 - task instance is `running` and `retryAt` has not passed, task is not claimed
0:10 - execution from 0:00 completes and status is set to `idle`. The `runAt` is set to the execution startTime + schedule, so it would be set to 0:01 and claimed during the next task manager polling cycle.

This causes a rule that was supposed to run every minute to actually run every ~5 minutes.

To further complicate things, one rule executions can return state which many rule types will use in the next executions. If a rule does not consistently exceed the timeout but only sporadically, this can cause race conditions and lead to unexpected results, possibly false positives or negatives.

For example the ES query rule type passes the timestamp of the last matched document between rule executions. If a particular rule execution took a particularly long time:

0:10 - execution 1: state.previousTimestamp = undefined, query for all matches, return 0:09 as previousTimestamp for next execution
0:11 - execution 2 (takes 10 minutes to run): state.previousTimestamp = 0:09, query for all matches where timestamp > 0:09.....
0:12 - execution 3: state.previousTimestamp is still 0:09 from execution 1, query for all matches where timestamp > 0:09
...
0:19 - execution 10: state.previousTimestamp is 0:16 from execution 9, query for all matches where timestamp > 0:16
0:20 - execution 2 completes finally. returns 0:10 as previousTimestamp for next execution
0:20 - execution 11: state.previousTimestamp is now 0:10 from the late complete of execution 2; query for all matches where timestamp > 0:10 --> This leads to duplicate match notifications :(
ymao1 commented 3 years ago

It seems to me like there are 2 issues here:

  1. Sometimes a rule type is running unexpected long (too much data or the query is not optimized) and we want to be able to identify those unexpectedly long running rules so that improvements can be made. This can be done using telemetry to keep track of rule execution durations by rule type id.
  2. Some rule types might be expected to take a long time to run. In these cases, we want to provide a better UX to try to protect the user from themselves when configuring the rule (warn when setting too short of a schedule, allow a preview capability to try out the execution to see how long it might take, etc). However, it may be that the user wants to have a rule running every minute that might take 6 minutes to complete.

I think in scenario (1), we definitely want to avoid having Kibana kick off a second version of a rule when the first hasn't finished running because we don't want to overload an already possibly overloaded cluster with another long-running rule. The second case is less clear to me. If we add the appropriate UX changes to make the user aware of the performance of the rule and they configure it aggressively anyways, should we let them do it? Or not because they don't understand the full impact of having multiple instances of a rule running in parallel and possibly updating the state out of order?

ymao1 commented 3 years ago

When the task manager is claiming tasks, it is looking for tasks where status=idle & runAt has passed OR status=running|claiming & retryAt has passed.

For recurring tasks, the first clause will claims those tasks whose previous runs have completed successfully. The second clause tries to handle the case where Kibana gets restarted or crashes mid task execution. The logic being if it's still running after the configured timeout, it must not actually be running.

I am wondering if we can pass a heartbeat callback into task manager tasks that when triggered updates the retryAt to a new value because it knows the rule is still running. Very rough POC here. The interval of the heartbeat could be configured based on the task definition timeout (so if the timeout is 5m, we could set the heartbeat interval to 4m. When the heartbeat is received, the task document is updated with a new retryAt. If Kibana crashes, the retryAt will stop getting updated and the logic would be the same as before. In this way, we can differentiate between a task that says running but actually isn't and a task that is still running.

pmuellr commented 3 years ago

Can we cancel the rule when it exceeds the 5m execution time? We know we can't (currently) cancel the ES/SO queries performed by the implementation, but perhaps we can cancel it at framework level so that it's result doesn't actually get saved at framework level?

We could have the framework recognize when the rule is "cancelled" by task manager, and write an event log entry indicating it went over the timeout, and try to nullify the rule's later effects, like queuing actions, updating state, etc. What about rule registry - guessing we don't want that updated either.

We know we can't (currently) cancel the ES/SO queries performed by the implementation

Do we? ES will cancel searches if the connection that started them closes. Could we close the connection somehow? Or provide a default (maybe not overrideable?) timeout, which we'd conveniently set to a value near the task timeout. There's probably something we can do ...

To further complicate things, one rule executions can return state which many rule types will use in the next executions. If a rule does not consistently exceed the timeout but only sporadically, this can cause race conditions and lead to unexpected results, possibly false positives or negatives.

Ya, easy to imagine. I think we'll need to make sure this can't happen. And we should give rules which are being cancelled the opportunity of returning new state to be stored; that will be the reward for implementing timeout cancellation :-)

am wondering if we can pass a heartbeat callback into task manager tasks that when triggered updates the retryAt to a new value because it knows the rule is still running.

Ya, interesting idea. What if the rule is in an endless loop though? Second timeout? Gets complicated. :-). And feels more important to stop "things" at a timeout first, before we consider letting them being extended. But it's a good idea to allow for extra slop to handle short-lived busy periods. Open an issue for later?

this 5 minutes is the default timeout specified by task manager but it would be straightforward for the alerting framework to override this default value and even allow individual rule types to specify their own timeout value.

+1

We'd probably want to implement this if we figure out how to "cancel" rules at timeouts, to provide some user control over it. And we'll need diagnostics and/or UI.

mikecote commented 3 years ago

To further complicate things, one rule executions can return state which many rule types will use in the next executions. If a rule does not consistently exceed the timeout but only sporadically, this can cause race conditions and lead to unexpected results, possibly false positives or negatives.

I wonder if we can rely on conflicts (409 errors) to prevent this from happening? Like capturing the SO version when starting to run the task, and use it to update the task SO after running. We'd have to ensure there are no supported mutations to the task SO while they're running (which I don't think we have)..

In the case of a conflict, it would mean the task timed out and another (or the same) Kibana re-claimed the SO, changed the version and caused the previous execution to fail (by throwing 409s).

In the framework, as a stepping stone, to ensure the framework doesn't do anything funky in timeout cases, we could capture if the task got cancelled during the executor call and prevent scheduling actions, etc. The next step could be to ensure the executors stop.

gmmorris commented 3 years ago

Love y'alll. Great write up and POC @ymao1 , and great discussion and ideas @pmuellr and @mikecote :)

A couple of thoughts:

@ymao1

This can be done using telemetry to keep track of rule execution durations by rule type id.

It would be worth while collecting all the telemetry we feel we're missing and slipping it into the 7.16 release. I assume anything in TM itself will have to wait until 8.x, but if we can at least get the telemetry out there then we'll have a baseline to work off of when measuring how successfully we've addressed the problem in 8.x.

Please add any telemetry you identify to this issue so that we can fit it in. :)

@ymao1

Note that this 5 minutes is the default timeout specified by task manager but it would be straightforward for the alerting framework to override this default value and even allow individual rule types to specify their own timeout value.

@pmuellr

We'd probably want to implement this if we figure out how to "cancel" rules at timeouts, to provide some user control over it.

Absolutely true on both points. 👍

In relation to users: I would be cautious with offering users a way to customise it unless we can guarantee that it really will be canceled. IIRC the ES and SO clients do not support cancellation. It might be that they can be enhanced to support it, but I don't think this is available just yet. If we can add that (or get the ES clients / core teams to do it), and we can automatically cancel any cal made by a rule when it exceeds the timeout, then I'd feel comfortable with users customising. But otherwise we'll just end up with SDHs about rules that the usr expected to cancel after 2 minutes but didn't.

In relation to rule type implementers: It's worth having a conversation with implementers, but I suspect most of them will expect their rule to complete quickly and not take 10+ minutes. I'm not sure how we do it, but I think the timeout should adapt to the circumstance. If a rule is configured to run once a minute, then a 10 minute execution is bad. But if the rule runs once a day, at midnight, and takes 10 minutes, I don't think anyone cares as long as it isn't overloading Kibana or ES. We need to figure out how to take that into account, but I don't think that's simple at all and probably warrants a wider discussion with tech leads and rule type implementers (RFC perhaps? We don't have good documentation of how this all works so perhaps an RFC could become a good source of truth for all of this?).

@ymao1

I am wondering if we can pass a heartbeat callback into task manager tasks that when triggered updates the retryAt to a new value because it knows the rule is still running.

I love the idea... but I'm racking my brain wrt failure cases and the additional overhead.

@pmuellr

What if the rule is in an endless loop though?

I'm not too worried about endless loops as we have a "self healing" mechanism for that, which is the event loop and CPU are blocked and all of Kibana exhibits problems forcing us to address the issue. Ideally we'd catch this in testing stage, but as we now know many of these "blocking" rules don't reveal themselves unltil they are ran against large and complex datasets (which we can address in testing infra, but I wouldn't invest time on that in the framework runtime just yet).

@pmuller

We could have the framework recognize when the rule is "cancelled" by task manager, and write an event log entry indicating it went over the timeout,...

Yes please :) This should appear in the rule monitoring along side drift and other health indicators a user/admin will want to know about.

@mikecote

I wonder if we can rely on conflicts (409 errors) to prevent this from happening? Like capturing the SO version when starting to run the task, and use it to update the task SO after running. We'd have to ensure there are no supported mutations to the task SO while they're running (which I don't think we have)..

We don't have anything mutating tasks other than TM, and that was intentional, but I think it's only a matter of time as richer access to the state might be a prerequisite to some complex rule type implementations. For example, in discussing the Metric and Log Threshold rule types with @jasonrhodes and @Kerry350 and rethinking their "look back" implementation we theorised the use of state to track past alert instances and use that to detect when something is missing. We might need richer access to that state to remove things that were tracked before and the user wants to forget. This is just a heads up to think about it, not a concrete requirement. :)

Another thought on 409s: I'm not sure how this will interact with the updateByQuery, so that will require further research, but definitely worth exploring.


phew... lots of thoughts.

Great work so far! I'm really looking forward to seeing the broken down issues we'd like to explore further. Feel free to add smaller issues (Event Log events and adding missing telemetry) into 7.16 if we can, as we need more observability into our framework in order to effectively tackle many of these issues well. But lets assume anything larger is not going to be picked up and tackled until 8.1+ and I'm working on clearing up capacity in the team to focus on these things.

pmuellr commented 3 years ago

phew... lots of thoughts

Yup. +1 on an RFC. I suspect there are some "quick wins" - more notification-y stuff when timeouts occur (telemetry, health, event log), but there's definitely a lot of work we're going to have to do to figure out the right long-term stuff - RFC seems right.

ymao1 commented 3 years ago

Thanks for all the input! After research and some POCs, here's what I've come up with:

We should be trying to handle both the cases where a rule is taking an unexpectedly long time to execute due to cluster load (in which case we should be taking steps to prevent multiple parallel executions) AND the case where a rule expectedly takes a long time to execute.


Rule always takes a long time to execute

In this scenario, a particular rule type always takes a long time to execute. This may be only true within a cluster, where rule type A, which in development looks performant, always takes upwards of 10 minutes to execute based on the volume and characteristic of the data it is processing. In this case, our goal as a framework is to try to surface these durations to the user and to provide rule type producers with more tools to react to these types of scenarios.

In the near term, I propose we do the following:

1. Allow rule types to pass in their own timeout values #111804 The alerting framework can provide a way for rule types to override the default timeout value of 5m. It will be up to the individual rule types to provide override values. Perhaps these values could be exposed in the kibana config so they can be changed quickly if necessary if a rule type is consistently taking a long time to run on a cluster.

2. Allow rule types to set default and minimum schedule intervals https://github.com/elastic/kibana/issues/51203 This is an old issue that seems very relevant now.

3. Surface execution durations to the user in the stack management UI https://github.com/elastic/kibana/issues/111805 POC for surfacing execution durations Since we now record execution durations in the event log, we should surface this information to the end user by writing the last duration to the rule execution status (so it exists in the rule saved object). On the list rules page, we could show a warning if the last duration for a rule exceeds a certain threshold or a warning if the last duration exceeds the scheduled interval by a lot. In addition, on the rule details page, we could return statistics (min/max/avg) about the last N execution durations as part of the getAlertInstanceSummary route. Again, if the average duration exceeds the scheduled interval by a lot, we could surface a warning for the user to adjust their rule configs.


Rule occasionally takes a long time to execute

In this scenario, the long rule execution is an aberration and our goal as a framework should be to make sure that these aberrant executions are cancellable, or if they are not cancellable, if they return long past the attempted cancellation, their state does not overwrite the state of other executions that have started & completed while it was running.


In the near term, I propose we do the following:

1. Provide framework level support for trying to cancel rule executions. POC for cancel behavior (Issue to come) Alerting framework should implement the task manager's cancel interface so when a task times out, it can make a best effort to cancel. At a framework level, we can

Note that while we can provide a cancellableSearch and a cancellation token, it will be up to each individual rule type to make use of them to short circuit rule execution.

2. Ensure that when the long-running rule returns, its outdated state does not override state from subsequent executions. POC for skipping update after task is cancelled (Issue to come) When a task times out, task manager currently tries to cancel the task if a cancel function is provided. If no method for cancellation is provided, the task will just continue running and update the task's state when it is done, whenever that may be. At a task manager level, we should be ensuring that the state returned by a cancelled task is not persisted. Since this task was canceled, we should consider any result of the task invalid. [1]

3. Provide telemetry to determine how often we see these long running rules - Added this to the general alerting services telemetry issue


Longer term, I propose we investigate the following:


[1] As an alternative, I also investigated using versions to control task updates if we want to allow "cancelled" tasks to update state as long as no other Kibana has picked up the task before the "cancelled" task returns. This is also a viable option but makes the task manager logic a little more complicated (even more than it already is!)

ymao1 commented 3 years ago

I created issues for the items under Rule always takes a long time to execute but I think it might make sense to write up an RFC for the items under Rule occasionally takes a long time to execute (other than telemetry). WDYT @mikecote?

gmmorris commented 3 years ago

Great observations and recommendations @ymao1 ! :elasticheart:

Note that while we can provide a cancellableSearch and a cancellation token, it will be up to each individual rule type to make use of them to short circuit rule execution.

Can we enforce this somehow?Or at least nudge teams into doing it? If we can detect this sort of thing at compile time that's obviously the best, but I assume we might only be able to do so at runtime. If we can detect at runtime that cancel has not actually canceled the request - Id like a warning (or event Error) to be logged and we can then file an issues for all teams who have not implemented this.

At a task manager level, we should be ensuring that the state returned by a cancelled task is not persisted.

++ 100% agree. And this should appear in the event log IMHO so that we can expose this in the UI and give customers insight into the performance/operations of their rules. Ideally the UI will also link to a documentation page that explains what this means and how they can address it.

3.Surface execution durations to the user in the stack management UI

Agree with everything here. Lets loop in @mdefazio and see if he has ideas for how best to expose these health indicators on the rule details page.

mdefazio commented 3 years ago

@gmmorris Yes, happy to do so. We are currently working on the Rules page to get an Observability solution in place. The current plan is to simply bring over the Security version. However, I know monitoring/health are top concerns for Security as well and we have a task to update these screens. It would definitely be helpful to know Stack's needs during this process.

ymao1 commented 3 years ago

@gmmorris Yes, happy to do so. We are currently working on the Rules page to get an Observability solution in place. The current plan is to simply bring over the Security version. However, I know monitoring/health are top concerns for Security as well and we have a task to update these screens. It would definitely be helpful to know Stack's needs during this process.

@mdefazio I tagged you on the specific UI issue for this. Thanks for your help!

pmuellr commented 3 years ago

Note that while we can provide a cancellableSearch and a cancellation token, it will be up to each individual rule type to make use of them to short circuit rule execution.

Can we enforce this somehow?Or at least nudge teams into doing it? If we can detect this sort of thing at compile time that's obviously the best, but I assume we might only be able to do so at runtime. If we can detect at runtime that cancel has not actually canceled the request - Id like a warning (or event Error) to be logged and we can then file an issues for all teams who have not implemented this.

Not clear how we could enforce it, but I'm also hopeful there might be some way to provide an augmented es/so client (or use the existing one with some config we're not aware of), to force the time out. So the rule would presumably end up throwing an error on an es/so call after the timeout was exceeded. Maybe it's just a wrapper around the clients that checks the timeout and throws an error if ever invoked. Seems like some unexplored areas to me.

Other than that, logging a warning on the timeouts would be good, if we're not doing that already.

gmmorris commented 3 years ago

BTW found this oldie :) Worth resurrecting or closing as dup: https://github.com/elastic/kibana/issues/64148

pmuellr commented 3 years ago

At a task manager level, we should be ensuring that the state returned by a cancelled task is not persisted.

++ 100% agree. And this should appear in the event log IMHO so that we can expose this in the UI and give customers insight into the performance/operations of their rules. Ideally the UI will also link to a documentation page that explains what this means and how they can address it.

Similar to the disabled and enabled event log docs I mentioned in a different issue, probably makes sense to write an event for rule-exceeded-timeout (and one for connectors as well), regardless of whether the handler deals with it or not. I think we'll need to see the full flow once we get this running to figure out what to write and when. It's possible it would make more sense to indicate the timeout on the execute event instead, but guessing not - a new independent event like this, that's not going to be written that often, seems fine.

But if we want this in the UI, then we'll want this in the execution status in the rule, not just in the event log. I'm poking through that stuff right now for the RAC stuff, there may already be a decent fit as we have this "failure" reasons, so we could probably add a new reason of "execute-timeout" or similar:

https://github.com/elastic/kibana/blob/6991f22e97ea8866cee3b47e009484bf5c500b82/x-pack/plugins/alerting/common/alert.ts#L27-L34

ymao1 commented 3 years ago

Not clear how we could enforce it, but I'm also hopeful there might be some way to provide an augmented es/so client (or use the existing one with some config we're not aware of), to force the time out. So the rule would presumably end up throwing an error on an es/so call after the timeout was exceeded. Maybe it's just a wrapper around the clients that checks the timeout and throws an error if ever invoked. Seems like some unexplored areas to me.

@pmuellr Yes! That's exactly what I explored in my POC. The Elasticsearch client search function has a built-in abort() that we can take advantage of that throws an error if aborted (effectively stopping rule execution). Seems to work pretty well as long as the long pole of the execution is the query itself.

Other than that, logging a warning on the timeouts would be good, if we're not doing that already.

Yep! Task manager is already logging a warning when tasks time out:

server    log   [10:07:29.146] [warning][plugins][taskManager] Cancelling task alerting:.es-query "4a5d2cf0-1240-11ec-9dd4-b5bfaab95f1d" as it expired at 2021-09-10T14:07:29.074Z after running for 01m 00s (with timeout set at 1m).
gmmorris commented 3 years ago

Do we feel this issue influences the priority of this related issue ?

I know they are only related, but it does make the point that other task types can affect rule and action execution.

pmuellr commented 3 years ago

I know they are only related, but it does make the point that other task types can affect rule and action execution.

Yes and no :-)

We may need some special attention on long running connectors, and then I think we'd lump all the other task types into "other task types".

I think both of these are lower priority than rules, as we generally haven't seen a problem with them, and in fact at least one task specifically has a long timeout because it was designed to "spread out it's work" over time (can't remember which one right now tho).

So - yes, we probably have some work to do for connectors and "other task types" - and presumably any changes we'd make for "other tasks" could affect the rule cancellation work. But I think the rule cancellation work takes priority over everything else.

gmmorris commented 3 years ago

Turning this issue into a meta issue now that the RFC is nearing approval and we have a bunch of follow-up issues.

Thanks for all the hard work @ymao1 !

Follow up issues:

ymao1 commented 3 years ago

Removing from project board since it is now a Meta issue.