While working on recent SDHs, it became evident that, in contrast to Elasticsearch, Kibana, and Task Manager, we don't have a lot of diag data for Security Solution and Detection Engine. There's not a lot of console logs, rule execution logs stored in .kibana-event-log-*, not enough correlation ids in all those logs, support-diagnostics tool does not support dumping anything related to Detection Engine.
Plan
Improve logging from rule executors. Write more/better logs with more correlation ids:
Add more correlation ids: rule dynamic SO ID (rule.id), rule static “signature” ID (rule.rule_id), rule name, rule type
Log ES query that is being executed
Improve clarity of logs. "Bulk indexing of signals failed" is misleading because we write it not only when it fails to index generated alerts, but also when it fails to query source events, and probably due to other potential reasons.
Write generic log messages to Event Log (writing some/all execution logs both to console and event log).
Improve logging from route handlers. Write logs with correlation ids from Security Solution's API endpoints:
Write debug logs for all server-side async actions. Example: "fetching tags of all rules" etc.
Add correlation ids: Kibana space, route handler name, page name that initiated the API call (can be passed and read via the Referer HTTP header)
NOTE: Correlation ids can be attached to any console log record via an additional LogMeta object (example) and are available for slicing and dicing if Kibana logs are ingested to ES. We could potentially leverage this in Cloud.
Include correlation ids to outgoing requests to Elasticsearch.
Since we need to analyze tasks.json file (generated by support-diagnostics tool) and it's not clear what rule sent a particular search request (and was it even a rule), it would be great if we could attach some correlation ids to requests that we send to Elasticsearch:
Referer type: rule executor or route handler
Rule id
Rule execution UUID
Route handler name
Maybe it could be done via custom HTTP headers similar to X-elastic-product-origin etc that we can see in tasks.json.
Measure more rule execution metrics:
Generic metrics:
total rule execution time
gap range (start time + end time) in addition to gap duration
number of source events matched during a rule execution
number of alerts generated by a given rule execution
NOTE: Detection Engine performance benchmarking could read generic and rule type-specific metrics written to Event Log during the benchmarking and calculate statistics (median, percentiles across all rules, per each rule type, per each rule, etc) as a result.
Summary
While working on recent SDHs, it became evident that, in contrast to Elasticsearch, Kibana, and Task Manager, we don't have a lot of diag data for Security Solution and Detection Engine. There's not a lot of console logs, rule execution logs stored in
.kibana-event-log-*
, not enough correlation ids in all those logs, support-diagnostics tool does not support dumping anything related to Detection Engine.Plan
Improve logging from rule executors. Write more/better logs with more correlation ids:
siem-detection-engine-rule-execution-info
saved objects. See https://github.com/elastic/kibana/issues/110135Improve logging from route handlers. Write logs with correlation ids from Security Solution's API endpoints:
Referer
HTTP header)NOTE: Correlation ids can be attached to any console log record via an additional
LogMeta
object (example) and are available for slicing and dicing if Kibana logs are ingested to ES. We could potentially leverage this in Cloud.Include correlation ids to outgoing requests to Elasticsearch. Since we need to analyze
tasks.json
file (generated by support-diagnostics tool) and it's not clear what rule sent a particular search request (and was it even a rule), it would be great if we could attach some correlation ids to requests that we send to Elasticsearch:Maybe it could be done via custom HTTP headers similar to
X-elastic-product-origin
etc that we can see intasks.json
.Measure more rule execution metrics:
NOTE: Detection Engine performance benchmarking could read generic and rule type-specific metrics written to Event Log during the benchmarking and calculate statistics (median, percentiles across all rules, per each rule type, per each rule, etc) as a result.