[Logs UI] [Alerting] Handle the "no data" case reliably

Kerry350 commented 4 years ago

Currently a log threshold alert can be configured to be less than 1, ultimately 0. 0 means that there is no data, a document count of 0.

This works fine for ungrouped scenarios, however it doesn't work (not 100%) when there is a group by applied.

The reason for this is the way the groups are queried for. We perform a composite aggregation to gather the groups, and then nested aggregations are performed to filter down that count to the count that matches the alert criteria.

However, when there are 0 documents we have the issue that no documents will exist with the group by field, and therefore we don't know about the groups.

We have a plaster in place which tries to mitigate this to a degree. We expand the timerange by 1 x interval on the "left" and "right", so a for the last 30 minutes check effectively gathers 90 minutes of documents, to try and capture more documents, and thus groups. We then use nested aggregations to narrow the timerange back down to the 30 minutes.

If there are still 0 documents in that widened time range, however, then we have no awareness of the groups.

Mitigations can be seen here and here.

To reliably offer "no data" alerts, we need awareness of which groups are supposed to exist, even when the composite aggregation can't gather them.

One solution could be to store the groups from the previous run in the alert state, we can then check in the next run if any groups are missing, if they're missing, then this would infer "no data".

We should be careful with this as a user could select a field producing hundreds of groups, and alert state is placed in a saved object.

We could possibly mitigate the above by placing a TTL on the groups that we roll over between runs.

(ℹ️ Currently when an alert is disabled, the alert state will be lost).

elasticmachine commented 4 years ago

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

weltenwort commented 4 years ago

One solution could be to store the groups from the previous run in the alert state, we can then check in the next run if any groups are missing, if they're missing, then this would infer "no data".

On second thought, how is that actually different from how it currently works? It effectively also widens the grouping query by n check intervals at the cost of additional state. :thinking:

Kerry350 commented 4 years ago

One solution could be to store the groups from the previous run in the alert state, we can then check in the next run if any groups are missing, if they're missing, then this would infer "no data".

On second thought, how is that actually different from how it currently works? It effectively also widens the grouping query by n check intervals at the cost of additional state. 🤔

My assumption was that the TTL we pick would be wider than what we "pad" with now, which is 1x interval for lte and gte, and those groups would continually "roll over" to the next runs until removed. However, it does just...move the problem. If said TTL still isn't big enough, we still "lose" the groups somewhere along the line. And the complexity of managing all of this would be pretty big (imo), for edge case payoff.

I've thought about this for a few days, and I can't actually think of a solution that works well 🤔

I'm tempted to say we shouldn't actually allow 0 values in group by scenarios at all, and instead require those to be turned into an ungrouped expression. The vast majority of cases would be able to be converted from one to the other. Rather than grouping by the field, the field could become part of the criteria (GROUP BY host.name > host.name MATCHES something-expected). This would probably increase the need for the wildcard / regex comparator support, and maybe it raises the question should there be an EXISTS comparator.

TL;DR In my opinion, the only way we can accurately support "no data" is with our ungrouped expressions.

jasonrhodes commented 3 years ago

Refinement update: @Kerry350 mentioned that @gmmorris and possibly @MikePaquette may have context on how we could help with a fix for this at the framework level, or something similar. We will check with them.

jasonrhodes commented 3 years ago

Refinement note: This is different than the absolute "no data" scenario (no data at all), it applies to groups because you don't know what groups are supposed to be there to know which ones disappeared.

Should we query alerts as data for this?

gmmorris commented 3 years ago

Refinement update: @Kerry350 mentioned that @gmmorris and possibly @MikePaquette may have context on how we could help with a fix for this at the framework level, or something similar. We will check with them.

I suspect you meant @mikecote rather than @MikePaquette ? 🤔

I had the idea we can create an alternate logic for "active alert" where an alert becomes active when it was detected previously but is no longer detected. This could help you avoid the broader querying but this was a "off the top of my head" idea and it would need to be researched further. For the record, I don't think this has to be implemented at framework level right away. A wrapper around the executor could achieve the same thing - so that might be the better initial approach to test out the idea and see if it work for 011y first, then perhaps we can generalize it to framework level.

elasticmachine commented 10 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

elastic / kibana

[Logs UI] [Alerting] Handle the "no data" case reliably #76511