[Alerts] Performance benchmarks

pmuellr commented 5 years ago

Alerting system load is affected by a number of factors:

number of alerts
frequency of alert checks
the load the alert check places on elasticsearch ( number of queries and query duration )
concurrent alert checks (spiky vs level load)
number of actions fired per alert

These can create a variety of load patterns. Under the hood, both alert checks and actions are handled with Task Manager which is backed by Elasticsearch, each of which will have throughput limits. As the system evolves we need a way to reproduce different types and sizes of load and observe the performance characteristics in different environments.

The objective of this issue is to build out such a tool, there are command line utilities already like @pmuellr repositories for kbn-actions and kbn-alerts as well as alerting samples that we could built upon to make this easy to setup, run, teardown.

I think ideally we'd have some way to control the variables above, and have a generic alert type that could take one or more elasticsearch queries (in SQL or ES DSL) to control the load of the alert check.

Steps

To-Do:

[x] Increase the max worker limit for cloud users (to something like 50, currently 20)

To-Do kbn-alert-load:

[x] Get execution failures in the report https://github.com/pmuellr/kbn-alert-load/pull/4
[x] Add support for ingestion (configurable ingestion rate) [done with Logstash]
[x] Get task manager stats in the report
[x] ~~Support automatic deployment sizing conversions~~ (https://github.com/elastic/kibana/issues/88388)
[x] ~~Move the tool into Kibana~~ (https://github.com/elastic/kibana/issues/88389)

Performance study:

[x] Alerts benchmarking
[x] Alerts vs actions benchmarking
[x] Alerts vs ingestion benchmarking

Original description

Ran a stress test yesterday with an alert that always triggers and action. Created 1000 of them, interval 1s, action .server_log. Never crashed or anything, but ES was steady at > 100% the entire time. Kibana steady at < 10%. No noticeable memory growth. Ran for ~12 hours. Need to look into the ES perf ...

elasticmachine commented 5 years ago

Pinging @elastic/kibana-stack-services

pmuellr commented 5 years ago

see PR https://github.com/elastic/kibana/pull/40291

pmuellr commented 4 years ago

Since this issue was last updated, Kibana is now doing some perf/load testing themselves. We should probably build on what they've done.

For more info, see issue https://github.com/elastic/kibana/issues/73189#issuecomment-704922643

pmuellr commented 4 years ago

Some additional thoughts.

We should aim to be able to run some manually launched but otherwise automated set of test on cloud that:

either spin up a new cloud instance, or point to an existing one
change task manager poll interval / max workers (when they become configurable)
change # of Kibana instances and ES instances, and the RAM associated with them
change the number of alerts, and how many instances are generated from them

There are a ton of knobs and dials, but given the combinatorial explosion, we should start small :-)

I've lately been measuring the "throughput" of the alerting / actions tasks running, by looking at the actions:execute and alerting:execute event documents - counts via date histogram. This is a rough number telling us how many alerts/actions are running per time unit. It seems to provide a pretty reasonable number, based on experiments of adding/reducing Kibanas on cloud.

We should also figure out some stats to gauge the general "health" of ES and Kibana. Probably CPU and memory usage would be a decent start, and adding some more ES stats later would be good.

In the end, would be nice to have a report showing data comparing some how these combinations change some of these metrics.

I've been using the index threshold alert, and feeding the index it's querying against with data live, to control whether actions will be running or not. Seems like a decent alert to test with. I've been using the server log action, which might actually have about the same latency as a "real" action (since most are HTTP calls to other cloud services), but perhaps working in a webhook call to some "interesting" and not spammy system would be more realistic.

mikecote commented 3 years ago

I'm closing this issue now that we have the kbn-alert-load tool built to measure performance benchmarks.

There are two follow up issues created that will be prioritized separately:

Automatic deployment sizing conversion in kbn-alert-load tool #88388
Move kbn-alert-load tool into Kibana alerting #88389

elastic / kibana

[Alerts] Performance benchmarks #40264

Steps