Closed ersin-erdal closed 2 years ago
I'm hoping we have the necessary date fields in the alerts-as-data indices to make these queries:
The rule registry already provides a RuleDataClient with a getReader()
function that returns a search()
function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote
It looks like the @timestamp
for an alert document is set/updated to the task.startedAt
time each time it is written out. So new alerts will set @timestamp: task.startedAt
and then if the alert is active in the next run (ongoing), the @timestamp
will get updated.
New alerts set kibana.alert.start
to the @timestamp
value.
Ongoing alerts keep the kibana.alert.start
field the same
Recovered alerts set a kibana.alert.end
to the @timestamp
value
With this combination of timestamps, I think we have enough to query for what we need.
I think for the previous run, this should be pretty straightfoward.
For an arbitrary time range, we should be able to query the @timestamp
using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start
time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end
time, we can determine whether the alert recovered during that time range.
Does that align with the requirements?
The rule registry already provides a RuleDataClient with a
getReader()
function that returns asearch()
function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote
I'm good with whatever is simple. When the FAAD is in place, we'll be removing this code so whatever is easier will help us. In regards to finding a way to expose this searching to the framework, I'll post my idea but don't feel obligated to use it:
getSummarizedAlerts(startTime, endTime)
)kibana.alert.start
within the rangekibana.alert.start
before the startTime and no value for kibana.alert.end
kibana.alert.end
within the rangeFor an arbitrary time range, we should be able to query the
@timestamp
using that range to get all alerts created/updated during that time range. Then based on thekibana.alert.start
time, we can determine whether the alert was new or ongoing during that time range. Using thekibana.alert.end
time, we can determine whether the alert recovered during that time range.Does that align with the requirements?
Makes sense to me, it matches the mental model I have.
Should we think about capping the data? e.g. Don't report more than 2000 alerts. if yes, we also need to return the real number of alerts.
With the alert circuit breaker in place which defaults to 1000, I believe the most we should return should be 1000 new/active and 1000 recovered.
@ersin-erdal I wasn't thinking about the query per time range, which can return many more alerts.
Right now, I'm querying based on time range and then splitting up the results into new/ongoing/recovered. If we're capping the number of results returned but still want the actual number of new/ongoing/recovered, we would have to push the condition to ES via a scripted query. Are we ok with scripted queries? They tend to be less performant.
cc @mikecote
Good catch, @ymao1! I think it would be valuable to get the total count while limiting how many new, ongoing and recovered alerts we return. You can also weigh multiple queries vs scripted queries if ever one is easier and more performant. We can also apply the 1000 limit to each array (new, ongoing, recovered).
@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?
For example, if we're querying against the last day of data and we have an alert for host-1
that became active, then recovered, then became active again in the day, we would be getting 2 new alerts (with different UUIDs) and 1 recovered alert for the host-1
alertId. If we only considered unique alertIds, we would only get 1 new and 1 recovered for that time range.
@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?
Yes, we should consider each alert separate and return both. I'm thinking this is best so the summary matches the changes that happened to the alert documents.
Meta: https://github.com/elastic/kibana/issues/143200
In order to add the alerts (as an output of Alert Summaries feature) to the action context we need to fetch the alerts from alert-as-data index. As the rule-registry already writes the alert-as-data we add a new provider to read the data as well.
We need a function attached to the rule type definition (automated by rule registry) that provides a search function to gather a series of alerts created on the last run or over a certain time span. The search function should be able to return new, ongoing and recovered alerts for a given time (last run or date range).
See https://github.com/elastic/kibana/issues/143200 for terminology of new, ongoing and recovered to help build queries.