elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

[Security Solution] Detection Engine health API #125642

Open banderror opened 2 years ago

banderror commented 2 years ago

Summary

Kibana Task Manager provides an api/task_manager/_health endpoint (doc 1, doc 2) which is very useful for troubleshooting performance and scaling issues with Security rules.

However, we could provide much more observability into the specifics of the Detection Engine and Security rule execution, which would help us troubleshoot issues with rule execution, cluster scaling, etc. The idea is to implement a Security-specific Detection Engine health API.

In the future, this API might become helpful for building more Rule Monitoring UIs giving our users more clarity and transparency about the work of the Detection Engine.

API requirements/ideas

It would be great to have an API that could provide a way to see how different "slices" or "scopes" of rules perform, for example:

For each scope, we could calculate and return a lot of info representing the current ("now") health of detection rules. In addition to that, we could specify some time-based parameters to calculate how health was changing over time:

Some ideas for what we could return from the API (each idea can apply to multiple scopes above):

It feels like this API should be composed of multiple endpoints.

To do

Some ideas worth discussing and planning, maybe as separate epics:

elasticmachine commented 2 years ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine commented 2 years ago

Pinging @elastic/security-solution (Team: SecuritySolution)

peluja1012 commented 1 year ago

Hey @banderror, @marshallmain suggested prioritizing the following, in light of recent SDHs.

  1. An API that checks all rule's index patterns and returns the rules that are querying indices with future timestamps, in addition to the actual index names with future timestamps in them
  2. An API that identifies the rules that are consuming the most total execution time - summing execution time over the executions for that rule, so it accounts for rules that are running more often
banderror commented 1 year ago

These are great suggestions, thanks @peluja1012 and @marshallmain! I added them to the description.

marshallmain commented 1 year ago

Added one more idea: top X rules by number of shards queried/shards queried in a particular data tier for identifying potential problem rules/indices