Open banderror opened 2 years ago
Pinging @elastic/security-detections-response (Team:Detections and Resp)
Pinging @elastic/security-solution (Team: SecuritySolution)
Hey @banderror, @marshallmain suggested prioritizing the following, in light of recent SDHs.
These are great suggestions, thanks @peluja1012 and @marshallmain! I added them to the description.
Added one more idea:
top X rules by number of shards queried/shards queried in a particular data tier
for identifying potential problem rules/indices
Summary
Kibana Task Manager provides an
api/task_manager/_health
endpoint (doc 1, doc 2) which is very useful for troubleshooting performance and scaling issues with Security rules.However, we could provide much more observability into the specifics of the Detection Engine and Security rule execution, which would help us troubleshoot issues with rule execution, cluster scaling, etc. The idea is to implement a Security-specific Detection Engine health API.
In the future, this API might become helpful for building more Rule Monitoring UIs giving our users more clarity and transparency about the work of the Detection Engine.
API requirements/ideas
It would be great to have an API that could provide a way to see how different "slices" or "scopes" of rules perform, for example:
For each scope, we could calculate and return a lot of info representing the current ("now") health of detection rules. In addition to that, we could specify some time-based parameters to calculate how health was changing over time:
Some ideas for what we could return from the API (each idea can apply to multiple scopes above):
It feels like this API should be composed of multiple endpoints.
To do
// TODO: https://github.com/elastic/kibana/issues/125642
comments.Some ideas worth discussing and planning, maybe as separate epics: