[Security Solution] Detection Engine health API

banderror commented 2 years ago

Summary

Kibana Task Manager provides an api/task_manager/_health endpoint (doc 1, doc 2) which is very useful for troubleshooting performance and scaling issues with Security rules.

However, we could provide much more observability into the specifics of the Detection Engine and Security rule execution, which would help us troubleshoot issues with rule execution, cluster scaling, etc. The idea is to implement a Security-specific Detection Engine health API.

In the future, this API might become helpful for building more Rule Monitoring UIs giving our users more clarity and transparency about the work of the Detection Engine.

API requirements/ideas

It would be great to have an API that could provide a way to see how different "slices" or "scopes" of rules perform, for example:

Health overview of the whole cluster
- Scope: all detection rules in all Kibana spaces, i.e. the whole cluster
Health overview of a space
- Scope: all detection rules in a given Kibana space
Health overview of a rule type
- Scope: detection rules of a given type in a given Kibana space
Health overview of a rule
- Scope: a given rule in a given Kibana space

For each scope, we could calculate and return a lot of info representing the current ("now") health of detection rules. In addition to that, we could specify some time-based parameters to calculate how health was changing over time:

Date range: last hour, day, week, month, year
Granularity: minute, hour, day, week, month

Some ideas for what we could return from the API (each idea can apply to multiple scopes above):

current health stats at the moment of the API call:
- number of Kibana instances
- number of Kibana spaces
- number of all rules
- number of enabled and disabled rules
- number of prebuilt and custom rules + how many of them are enabled and disabled
- number of rules of each type + how many of them are enabled and disabled
- number of rules with exceptions
- number of rules with notification actions
- number of rules with legacy notification actions
- number of rules with response actions
- number of rules by last execution status (succeeded, partial failure, failed, no status)
- top X last failed statuses (messages) + rule ids for each status
- top X last partial failure statuses (messages) + rule ids for each status
- top X slowest rules by a few metrics (last total execution time, last search time, last indexing time)
- top X rules with the largest scheduling delay (drift)
- all rules that are querying indices with future timestamps + the actual index names with future timestamps in them (the API would need to check all rule's index patterns and data views)
- top X rules by number of shards queried/shards queried in a particular data tier
health stats over a specified period of time:
- aggregated rule execution metrics, e.g. number of executions, total execution time, query time, indexing time, scheduling delay (drift), detected gaps, etc
- for most metrics: a set of percentiles that would be helpful in most cases, e.g. p01, p05, p25, p50, p75, p95, p99
- change of rule execution statuses over time: a histogram for each status
- change of rule execution metrics over time: a histogram for each percentile of each metric
- top X rules by each execution metric
- top X rules that are consuming the most total execution time - summing execution time over the executions for that rule, so it accounts for rules that are running more often
- top X rules by final execution status (rules failing most often or ending up with a partial failure)
- top X rules by logged messages of different log levels (errors and warnings are most interesting)
- top X errors, top X warnings
- top X error codes (we don't have error codes in our logs yet)

It feels like this API should be composed of multiple endpoints.

To do

[x] Work on a PoC and implement an internal Detection Engine health API. (PR)
[x] Write developer documentation for how to use it. (PR, PR)
[x] Add support for it to the support-diagnostics tool so its output could be available in the diagnostic dumps. (PR, PR)
[x] Implement the cluster health endpoint. (PR)
[ ] Implement an API that checks all rule's index patterns and returns the rules that are querying indices with future timestamps, in addition to the actual index names with future timestamps in them
[ ] Implement an API that identifies the rules that are consuming the most total execution time - summing execution time over the executions for that rule, so it accounts for rules that are running more often
[ ] Calculate more metrics for the rule health endpoint.
[ ] Calculate more metrics for the space health endpoint.
[ ] Address // TODO: https://github.com/elastic/kibana/issues/125642 comments.
[ ] Write a test plan and cover the API with integration and unit tests.
[ ] At some point, consider making the API public and writing user-facing documentation for how to use it.

Some ideas worth discussing and planning, maybe as separate epics:

Telemetry: sending aggregated health metrics (e.g. calculated by the cluster health endpoint) to telemetry clusters. That's how we could measure production rule health and performance across user clusters. Devon Kerr: "Theoretically, it could also give you implicit insights— if we could cross-reference by cluster configuration, we could also notify users who under-resourced their clusters based on rule performance".
Adding support to the support-diagnostics tool for executing aggregation queries directly against Elasticsearch, not via the Detection Engine health API, for stack versions lower than X where the health API endpoints were added.
Writing dev docs on Elasticsearch queries useful for troubleshooting the health and performance of detection rules.

elasticmachine commented 2 years ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine commented 2 years ago

Pinging @elastic/security-solution (Team: SecuritySolution)

peluja1012 commented 1 year ago

Hey @banderror, @marshallmain suggested prioritizing the following, in light of recent SDHs.

An API that checks all rule's index patterns and returns the rules that are querying indices with future timestamps, in addition to the actual index names with future timestamps in them
An API that identifies the rules that are consuming the most total execution time - summing execution time over the executions for that rule, so it accounts for rules that are running more often

banderror commented 1 year ago

These are great suggestions, thanks @peluja1012 and @marshallmain! I added them to the description.

marshallmain commented 1 year ago

Added one more idea: top X rules by number of shards queried/shards queried in a particular data tier for identifying potential problem rules/indices

elastic / kibana