[Logs UI] Optimize grouped rule execution in the log threshold rule type

weltenwort commented 2 years ago

:notebook: Summary

We want to optimize the way the log threshold rule type executor queries and processes the data as to not block the nodejs event loop and decrease memory usage.

part of #98010

:information_source: Background

The log threshold rule type excecutor currently handles four particular cases based on the rule params:

ungrouped with a single "count" criterion

execute simple query with size: 0
fire alert based on comparison between total hit count and configured threshold
complexity in ES: single search without scoring and size: 0
complexity in nodejs: constant time and memory

ungrouped with a "ratio" of two criteria

for the denominator and numerator each: execute a simple queries with size: 0
calculate ratio of the two total hit counts
fire alert based on comparison of the ratio and configured threshold
complexity in ES: two searches without scoring and size: 0
complexity in nodejs: constant time and memory

grouped with a single "count" criterion

execute a query with a composite agg over all grouping criteria as terms with page size 2000
collect all buckets into an array until the pages are exhausted
for each bucket: fire alert based on comparison between hit count and configured threshold
complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
complexity in nodejs: linear time and memory regarding number of groups

grouped with a "ratio" of two criteria

for the denominator and numerator each: execute a query with a composite agg over all grouping criteria as terms with page size 2000
collect all buckets for each query into an array each until the pages are exhausted
for each bucket in the first array: find the matching bucket in the second array, calculate the ratio, and fire alert based on comparison between ratio and configured threshold
complexity in ES: a number of searches that increases linearly with the number of groups (divided by page size)
complexity in nodejs:
- quadratic time regarding number of groups (finding the matching bucket)
- linear memory regarding number of groups

:bulb: Optimizations

The grouped cases have particularly high optimization potential due to two factors:

The comparison with the thresholds is done on nodejs instead of Elasticsearch, which means the full result set needs to be fetched.
The results are gathered into memory completely before they're processed. This can lead to high memory usage and blocks the event loop while the IO-less comparisons are made.

Perform more computation in Elasticsearch

In order to ensure the alert doesn't miss any groups to approximate terms results, the grouping is performed using a composite aggregation. That aggregation currently has the limitation that it can't be post-processes as a whole using a sibling pipeline agg such as bucket_selector.

The individual pages can be processed that way, though. A bucket_selector in a composite agg can remove buckets from a page that don't match certain criteria (such as having a doc count above or below a threshold). In consequence pages would be partly empty leading to smaller response sizes and the threshold computation would be performed by Elasticsearch. This advantage would be largest for the most complex case of the grouped ratio rule, where a script could calculate the ratio before performing the filtering. That also means that the nominator/denominator filters probably needs to be moved (or duplicated) beneath the composite agg.

If the bucket_selector script is written to take the threshold as params its compilation result could even be cached between executions.

Process results incrementally

Assuming the ratio calculation and filtering is performed in a bucket_selector script as described above, the pages of groups could be processes immediately as they come in. This would interleave IO operations with the (much reduced) computation, which would allow for the event loop to preempt execution in favor of other workloads. Currently, the code has a high degree of code-reuse due to its carefully crafted decomposition (:clap:), which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

:heavy_check_mark: Acceptance criteria

In the "grouped count" case: The filtering of the buckets is performed for each page in Elasticsearch instead of inline in the log threshold rule type executor.
In the "grouped ratio" case: The ratio calculation for each bucket and the filtering of the buckets for each page are performed in Elasticsearch instead of inline in the log threshold rule type executor.
In both "grouped" cases: The executor immediately calls the action factory after receiving a page such that...
- no more than one page is kept in memory at a time.
- control is yielded to the event loop at least once per page.

elasticmachine commented 2 years ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

weltenwort commented 2 years ago

@Kerry350 I'd like to hear your thoughts on this. Do you think it's feasible or did I miss anything?

Kerry350 commented 2 years ago

@weltenwort Thanks for breaking this down so thoroughly. This seems totally feasible to me using the methods you've stated.

Do you think we should make some before / after metrics (probably just execution time) part of the ACs? (Tends to be easier to grab those things as we develop).

which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

I agree we'll end up with something that looks quite different, but we should be able to come up with something "easy to follow" again.

weltenwort commented 2 years ago

Do you think we should make some before / after metrics (probably just execution time) part of the ACs?

That would be neat, but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Kerry350 commented 2 years ago

but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Yeah, fair point. I know Chris did some good work to produce high cardinality datasets for metrics, but I appreciate that might not help here.

For characteristics I'd say we need one or more fields to represent the high cardinality nature, and then it would also be handy to have some fields with varied types (text, keyword etc) so that we can test against different comparators (which have the largest influence on the eventual query).

weltenwort commented 2 years ago

After looking at the synthtrace architecture and discussing it with @miltonhultgren I'd say it should be pretty feasible to add support for generating log entries and setting up the correct mappings. It's not a quick thing, though, so I'd rather not make it a dependency of this issue.

I'll try to come up with a simple bash script or so to use as a fallback in case this is prioritized higher than the synthtrace improvement.

jasonrhodes commented 2 years ago

Sorry for the delay in responding but this sounds like an excellent plan of action to me.

elasticmachine commented 11 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

elastic / kibana