[Security Solution] Troubleshooting and Diagnostics of the Detection Engine (Draft)

banderror commented 2 years ago

Summary

While working on recent SDHs, it became evident that, in contrast to Elasticsearch, Kibana, and Task Manager, we don't have a lot of diag data for Security Solution and Detection Engine. There's not a lot of console logs, rule execution logs stored in .kibana-event-log-*, not enough correlation ids in all those logs, support-diagnostics tool does not support dumping anything related to Detection Engine.

Plan

Improve logging from rule executors. Write more/better logs with more correlation ids:

Write rule execution UUID to rule execution events in Event Log and to siem-detection-engine-rule-execution-info saved objects. See https://github.com/elastic/kibana/issues/110135
Add more correlation ids: rule dynamic SO ID (rule.id), rule static “signature” ID (rule.rule_id), rule name, rule type
Log ES query that is being executed
Improve clarity of logs. "Bulk indexing of signals failed" is misleading because we write it not only when it fails to index generated alerts, but also when it fails to query source events, and probably due to other potential reasons.
Write generic log messages to Event Log (writing some/all execution logs both to console and event log).

Improve logging from route handlers. Write logs with correlation ids from Security Solution's API endpoints:

Write debug logs for all server-side async actions. Example: "fetching tags of all rules" etc.
Add correlation ids: Kibana space, route handler name, page name that initiated the API call (can be passed and read via the Referer HTTP header)

NOTE: Correlation ids can be attached to any console log record via an additional LogMeta object (example) and are available for slicing and dicing if Kibana logs are ingested to ES. We could potentially leverage this in Cloud.

Include correlation ids to outgoing requests to Elasticsearch. Since we need to analyze tasks.json file (generated by support-diagnostics tool) and it's not clear what rule sent a particular search request (and was it even a rule), it would be great if we could attach some correlation ids to requests that we send to Elasticsearch:

Referer type: rule executor or route handler
Rule id
Rule execution UUID
Route handler name

Maybe it could be done via custom HTTP headers similar to X-elastic-product-origin etc that we can see in tasks.json.

Measure more rule execution metrics:

Generic metrics:
- total rule execution time
- gap range (start time + end time) in addition to gap duration
- number of source events matched during a rule execution
- number of alerts generated by a given rule execution
Rule type-specific metrics:
- total indicator count https://github.com/elastic/kibana/issues/111903

NOTE: Detection Engine performance benchmarking could read generic and rule type-specific metrics written to Event Log during the benchmarking and calculate statistics (median, percentiles across all rules, per each rule type, per each rule, etc) as a result.

elasticmachine commented 2 years ago

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine commented 2 years ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elastic / kibana

[Security Solution] Troubleshooting and Diagnostics of the Detection Engine (Draft) #124947

Summary

Plan