elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.82k stars 8.2k forks source link

[Alerts] Performance benchmarks #40264

Closed pmuellr closed 3 years ago

pmuellr commented 5 years ago

Alerting system load is affected by a number of factors:

These can create a variety of load patterns. Under the hood, both alert checks and actions are handled with Task Manager which is backed by Elasticsearch, each of which will have throughput limits. As the system evolves we need a way to reproduce different types and sizes of load and observe the performance characteristics in different environments.

The objective of this issue is to build out such a tool, there are command line utilities already like @pmuellr repositories for kbn-actions and kbn-alerts as well as alerting samples that we could built upon to make this easy to setup, run, teardown.

I think ideally we'd have some way to control the variables above, and have a generic alert type that could take one or more elasticsearch queries (in SQL or ES DSL) to control the load of the alert check.

Steps

To-Do:

To-Do kbn-alert-load:

Performance study:

Original description Ran a stress test yesterday with an alert that always triggers and action. Created 1000 of them, interval 1s, action .server_log. Never crashed or anything, but ES was steady at > 100% the entire time. Kibana steady at < 10%. No noticeable memory growth. Ran for ~12 hours. Need to look into the ES perf ...
elasticmachine commented 5 years ago

Pinging @elastic/kibana-stack-services

pmuellr commented 5 years ago

see PR https://github.com/elastic/kibana/pull/40291

pmuellr commented 4 years ago

Since this issue was last updated, Kibana is now doing some perf/load testing themselves. We should probably build on what they've done.

For more info, see issue https://github.com/elastic/kibana/issues/73189#issuecomment-704922643

pmuellr commented 4 years ago

Some additional thoughts.

We should aim to be able to run some manually launched but otherwise automated set of test on cloud that:

There are a ton of knobs and dials, but given the combinatorial explosion, we should start small :-)

I've lately been measuring the "throughput" of the alerting / actions tasks running, by looking at the actions:execute and alerting:execute event documents - counts via date histogram. This is a rough number telling us how many alerts/actions are running per time unit. It seems to provide a pretty reasonable number, based on experiments of adding/reducing Kibanas on cloud.

We should also figure out some stats to gauge the general "health" of ES and Kibana. Probably CPU and memory usage would be a decent start, and adding some more ES stats later would be good.

In the end, would be nice to have a report showing data comparing some how these combinations change some of these metrics.

I've been using the index threshold alert, and feeding the index it's querying against with data live, to control whether actions will be running or not. Seems like a decent alert to test with. I've been using the server log action, which might actually have about the same latency as a "real" action (since most are HTTP calls to other cloud services), but perhaps working in a webhook call to some "interesting" and not spammy system would be more realistic.

mikecote commented 3 years ago

I'm closing this issue now that we have the kbn-alert-load tool built to measure performance benchmarks.

There are two follow up issues created that will be prioritized separately: