[Discuss] Efficiently utilizing Kibana tasks during rules with long running queries

marshallmain commented 3 hours ago

Summary

The goal of this discussion is to explore ways that we can consistently utilize a higher proportion of available resources on Kibana nodes, reducing the number of Kibana nodes necessary to run security detection rules. The main challenge to be addressed here is varying query durations across different rules and rule types. When queries are long, Kibana spends a long time idle waiting for a response and we would want task concurrency to be high. However, when queries are short, Kibana spends less time idle and we'd want a lower task concurrency. This initial comment proposes a mechanism by which we could dynamically adjust the number of tasks that are in progress on a single Kibana node by having tasks "sleep" while long running queries are processing in Elasticsearch.

Background

The alerting 10x scale project vastly improved the maximum theoretical throughput for alerting rules from 3200 rules per minute per cluster to well over 32,000 rules per minute per cluster. The framework increased both the maximum horizontal scalability of the task claiming architecture and reduced the overhead associated with claiming an individual task, so we can run more Kibana nodes and each node can handle more tasks than before. However, while these efforts will vastly improve throughput when the Security solution logic in a rule completes quickly, there is still more we can do in cases where the rule makes long running queries to Elasticsearch.

During a long running query, Kibana is not using significant resources for the rule making the query yet a task slot is still being used by the rule. Conversely, rules that make a series of short queries will have higher average resource usage per task in Kibana. This makes it challenging to select a single number of concurrent tasks that we should allow per Kibana node to efficiently utilize the available resources. If we choose too low a number of concurrent tasks, Kibana will be mostly idle when it picks up rules that make long running queries. If we choose too high a number, Kibana will be oversubscribed when it picks up rules with shorter queries.

Proposal

To create a more dynamic task concurrency mechanism, we could implement asynchronous requests at the framework and task manager level and allow tasks to "sleep" while waiting for the query results to be available. While a task is sleeping it would not count towards the maximum number of concurrent tasks (though we might implement a "max sleeping tasks" per node), the framework could pick up and run other tasks. Once the query results are available, the task would "wake up", claim an empty task slot, and continue executing where it left off. Asynchronous requests would be exposed as a separate function on the Elasticsearch client so solution rule authors can opt in to the behavior in places where we know we're likely to see long running queries and accept any additional latency/overhead. When a node polls for tasks to claim, sleeping tasks on that node would be prioritized for waking up over claiming new tasks.

In practice, this might go as follows: a Kibana node has 10 rules actively running and one is an EQL sequence rule. Since we know EQL sequences often take a long time to execute, we've modified the Security EQL rule type to use async framework requests. The EQL sequence rule makes the API call, the framework sends the request to Elasticsearch, and the task goes to sleep. Internally, the framework async request implementation polls for query results periodically. Once the results are ready, the task becomes "ready to wake". On the next poll for tasks to claim, as many "ready to wake" tasks are woken up as possible - at most however many available task slots there are - and the query results are loaded from Elasticsearch. (For EQL search, this delayed results fetching is made possible by the get async EQL search status API, so we don't have to load the full result set into memory until the task is fully woken up.) At this point the results are returned back to the solution specific logic and the rule continues on without having to deal with any of the polling in solution logic - and other tasks have been able to be claimed and run in the meantime.

elasticmachine commented 3 hours ago

Pinging @elastic/security-detection-engine (Team:Detection Engine)

elasticmachine commented 3 hours ago

Pinging @elastic/response-ops (Team:ResponseOps)

elastic / kibana