elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

[Discuss] Should the framework drop accumulated alerts (and their related notifications) when a rule run times out? #139237

Open mikecote opened 2 years ago

mikecote commented 2 years ago

The alerting framework currently adopts an all-or-nothing approach when it comes to persisting alerts and sending notifications. If a rule run times out, all accumulated alerts are dropped and nothing is persisted for that rule run.

I'm opening this issue to discuss if the framework should do something with the accumulated alerts when encountering a timeout? I'm wondering if we can take what we learned from alert circuit breakers and apply similar logic on timeout as we do when the alert circuit breaker is hit 🤔

cc @shanisagiv1

elasticmachine commented 2 years ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 2 years ago

Is there some context on why we did it this way - the way it's working today?

I guess one thing is that if a rule times out after only handling some of it's queries, and say marked some alerts as still active, pretend it would have marked some other ones active if it did not time out. Those will now recover. But we probably don't want that. Almost feels like you could accept active alerts from a timed out rule, but shouldn't process recovered alerts from such runs - those alerts would remain active (or some other non-recovered state like maybe-active?)

ymao1 commented 2 years ago

If I recall correctly, we did it this way to avoid timed out runs possibly overwriting the state for rules that run ok. This is because even though we are cancelling ES queries and providing rule type executors with the services that they can check to see if the run is cancelled, we don't have 100% adoption, so we're still not guaranteeing that when a run is cancelled, the execution completely stops. It could be that rule runs the ES query but then takes 10 minutes to post-process the results, during which the run times out, but the rule keeps running. The next execution of this rule gets picked up and finishes within the timeout, updating the task manager state with its latest. Then the previous execution finally finishes. If we processed those alerts and then persisted them to the task state, we would be overwriting the state from a newer execution and sending outdated notifications.

Ideally, when we know for sure that cancelling a rule 100% stops the execution, we could look into doing something like what the alert circuit breaker does.

mikecote commented 2 years ago

Yeah, I think it'll take time to be 100% sure that cancelling a rule stops the execution. But I wonder if we should do anything with the alerts that did get reported before the timeout occurs.

For example, 5 minute timeout: minute 0 - start rule execution minute 0 - start query minute 4 - query returns partial alerts minute 4 - report alert A, B, C to the platform minute 4 - start a second query minute 5 - timeout error! ...

I wonder if the framework should do something with alert A, B, and C which got reported prior to the timeout occurring. Hopefully my example is clear, happy to make some diagrams or discuss synchronously.