elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.8k stars 8.19k forks source link

[Logs UI] Optimize grouped rule execution in the log threshold rule type #124130

Open weltenwort opened 2 years ago

weltenwort commented 2 years ago

:notebook: Summary

We want to optimize the way the log threshold rule type executor queries and processes the data as to not block the nodejs event loop and decrease memory usage.

part of #98010

:information_source: Background

The log threshold rule type excecutor currently handles four particular cases based on the rule params:

ungrouped with a single "count" criterion

ungrouped with a "ratio" of two criteria

grouped with a single "count" criterion

grouped with a "ratio" of two criteria

:bulb: Optimizations

The grouped cases have particularly high optimization potential due to two factors:

Perform more computation in Elasticsearch

In order to ensure the alert doesn't miss any groups to approximate terms results, the grouping is performed using a composite aggregation. That aggregation currently has the limitation that it can't be post-processes as a whole using a sibling pipeline agg such as bucket_selector.

The individual pages can be processed that way, though. A bucket_selector in a composite agg can remove buckets from a page that don't match certain criteria (such as having a doc count above or below a threshold). In consequence pages would be partly empty leading to smaller response sizes and the threshold computation would be performed by Elasticsearch. This advantage would be largest for the most complex case of the grouped ratio rule, where a script could calculate the ratio before performing the filtering. That also means that the nominator/denominator filters probably needs to be moved (or duplicated) beneath the composite agg.

If the bucket_selector script is written to take the threshold as params its compilation result could even be cached between executions.

Process results incrementally

Assuming the ratio calculation and filtering is performed in a bucket_selector script as described above, the pages of groups could be processes immediately as they come in. This would interleave IO operations with the (much reduced) computation, which would allow for the event loop to preempt execution in favor of other workloads. Currently, the code has a high degree of code-reuse due to its carefully crafted decomposition (:clap:), which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

:heavy_check_mark: Acceptance criteria

elasticmachine commented 2 years ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

weltenwort commented 2 years ago

@Kerry350 I'd like to hear your thoughts on this. Do you think it's feasible or did I miss anything?

Kerry350 commented 2 years ago

@weltenwort Thanks for breaking this down so thoroughly. This seems totally feasible to me using the methods you've stated.

Do you think we should make some before / after metrics (probably just execution time) part of the ACs? (Tends to be easier to grab those things as we develop).

which would be a bit harder in an interleaved execution model. But it's probably still possible to come up with an adequate structure, even if it looks a bit different.

I agree we'll end up with something that looks quite different, but we should be able to come up with something "easy to follow" again.

weltenwort commented 2 years ago

Do you think we should make some before / after metrics (probably just execution time) part of the ACs?

That would be neat, but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Kerry350 commented 2 years ago

but would require a pretty big test dataset. I'll try to see how difficult it would be to appropriate the synthtrace cli for that. Any thoughts on what characteristics that dataset needs to have?

Yeah, fair point. I know Chris did some good work to produce high cardinality datasets for metrics, but I appreciate that might not help here.

For characteristics I'd say we need one or more fields to represent the high cardinality nature, and then it would also be handy to have some fields with varied types (text, keyword etc) so that we can test against different comparators (which have the largest influence on the eventual query).

weltenwort commented 2 years ago

After looking at the synthtrace architecture and discussing it with @miltonhultgren I'd say it should be pretty feasible to add support for generating log entries and setting up the correct mappings. It's not a quick thing, though, so I'd rather not make it a dependency of this issue.

I'll try to come up with a simple bash script or so to use as a fallback in case this is prioritized higher than the synthtrace improvement.

jasonrhodes commented 2 years ago

Sorry for the delay in responding but this sounds like an excellent plan of action to me.

elasticmachine commented 11 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)