elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.66k stars 8.23k forks source link

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

Closed ersin-erdal closed 2 years ago

ersin-erdal commented 2 years ago

Meta: https://github.com/elastic/kibana/issues/143200

In order to add the alerts (as an output of Alert Summaries feature) to the action context we need to fetch the alerts from alert-as-data index. As the rule-registry already writes the alert-as-data we add a new provider to read the data as well.

We need a function attached to the rule type definition (automated by rule registry) that provides a search function to gather a series of alerts created on the last run or over a certain time span. The search function should be able to return new, ongoing and recovered alerts for a given time (last run or date range).

See https://github.com/elastic/kibana/issues/143200 for terminology of new, ongoing and recovered to help build queries.

mikecote commented 2 years ago

I'm hoping we have the necessary date fields in the alerts-as-data indices to make these queries:

ymao1 commented 2 years ago

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

ymao1 commented 2 years ago

It looks like the @timestamp for an alert document is set/updated to the task.startedAt time each time it is written out. So new alerts will set @timestamp: task.startedAt and then if the alert is active in the next run (ongoing), the @timestamp will get updated.

New alerts set kibana.alert.start to the @timestamp value. Ongoing alerts keep the kibana.alert.start field the same Recovered alerts set a kibana.alert.end to the @timestamp value

With this combination of timestamps, I think we have enough to query for what we need.

I think for the previous run, this should be pretty straightfoward.

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

mikecote commented 2 years ago

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

I'm good with whatever is simple. When the FAAD is in place, we'll be removing this code so whatever is easier will help us. In regards to finding a way to expose this searching to the framework, I'll post my idea but don't feel obligated to use it:

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

Makes sense to me, it matches the mental model I have.

ersin-erdal commented 2 years ago

Should we think about capping the data? e.g. Don't report more than 2000 alerts. if yes, we also need to return the real number of alerts.

ymao1 commented 2 years ago

With the alert circuit breaker in place which defaults to 1000, I believe the most we should return should be 1000 new/active and 1000 recovered.

ymao1 commented 2 years ago

@ersin-erdal I wasn't thinking about the query per time range, which can return many more alerts.

Right now, I'm querying based on time range and then splitting up the results into new/ongoing/recovered. If we're capping the number of results returned but still want the actual number of new/ongoing/recovered, we would have to push the condition to ES via a scripted query. Are we ok with scripted queries? They tend to be less performant.

cc @mikecote

mikecote commented 2 years ago

Good catch, @ymao1! I think it would be valuable to get the total count while limiting how many new, ongoing and recovered alerts we return. You can also weigh multiple queries vs scripted queries if ever one is easier and more performant. We can also apply the 1000 limit to each array (new, ongoing, recovered).

ymao1 commented 2 years ago

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

For example, if we're querying against the last day of data and we have an alert for host-1 that became active, then recovered, then became active again in the day, we would be getting 2 new alerts (with different UUIDs) and 1 recovered alert for the host-1 alertId. If we only considered unique alertIds, we would only get 1 new and 1 recovered for that time range.

mikecote commented 2 years ago

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

Yes, we should consider each alert separate and return both. I'm thinking this is best so the summary matches the changes that happened to the alert documents.