elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.7k stars 8.12k forks source link

[Alerting] Investigate performing gap analysis at a framework level #113562

Open ymao1 opened 2 years ago

ymao1 commented 2 years ago

Related to the investigation into long running rules, we should look into bringing the gap analysis that security solutions is doing in their rules into the framework. This would allow us to see the gaps of data that might be missed for rules configured to run on a short interval that consistently run for a long time.

For example, a rule configured to run every minute, looking back over the last minute, that takes 3 minutes to run will consistently miss data:

0:01 - rule execution 1 starts and queries over 0:00 - 0:01 0:02 - rule execution 2 should start but doesn't bc rule execution 1 hasn't finished yet 0:03 - rule execution 3 should start but doesn't bc rule execution 1 hasn't finished yet 0:04 - rule execution 1 finishes, rule execution 2 starts and queries over 0:03 - 0:04

We can see that there is a gap between 0:01 and 0:03 where alerts might have been missed.

We should look into what security solutions is doing wrt to gap analysis and what we can do at a framework level to (1) inform users of possible gaps and (2) enable all rule types to cover any gaps.

elasticmachine commented 2 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris commented 2 years ago

Just a word of warning that this is quite domain-specific, and not something the framework should necceserily make assumptions about.

Secuirty vs. Observability

In Security Solution it makes sense that all data should be evaluated when there is a gap, as you want to know what happened while the system was saturated and delayed. A complete record is critical for the security use case to be reliable.

In Observability, on the other hand, a gap doesn't necessarily warrant evaluation of the historical data, especially if the longer query time means further impacting real-time data in the rule. It's more important for me to know what is the state now, than what was the state half an hour ago. Obviously, completeness is valuable, but it is less valuable than real-time data and it would be better to have a gap than to delay real-time data.

What should we provide at framework level?

The fact that this is domain specific aligns with how the framework handles things now - queries are entirely in the hands of the rule types themselves, as they know what it is that they want to query, while the framework provides the mechanism for doing so.

What I do think the framework should provide is a unified approach to identifying that a gap has occured and getting the required information to remediate (such as the timestamp of the last execution) - for the most part I believe this is already the case, but it would be worth validating that this is the case.

I'd definitely like to hear thoughts from @arisonl @MikePaquette and @cyrille-leclerc on this before we make any changes at framework level. :)

pmuellr commented 2 years ago

There some other related aspects to this like de-duping data received from previous rule runs, when doing calculations in new rule runs - some rules do this, some don't. It would be nice to have some consistency, even if just by name, for these sorts of things.

pmuellr commented 2 years ago

Another common related theme is a "window" - I think almost every rule has a concept of this, so it probably makes sense to think about this at a framework level.