elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

[Security Solution][Detection Engine] Investigate ways to bound memory usage of rule queries #192732

Open marshallmain opened 1 month ago

marshallmain commented 1 month ago

Parent issue: https://github.com/elastic/security-team/issues/10106

Detection rules typically fetch 100 source documents at a time to transform into alerts. When these source documents are large, this puts significant memory pressure on both Elasticsearch and Kibana. If the documents are large enough, Elasticsearch and/or Kibana can run out of memory and crash. We should investigate ways that we can limit the total amount of data retrieved at one time to avoid OOM problems.

elasticmachine commented 1 month ago

Pinging @elastic/security-detection-engine (Team:Detection Engine)

marshallmain commented 3 days ago

One potential extreme approach to mitigate this issue is to separate the rule query logic from the alert creation process. We would modify the logic of each rule type to initially retrieve only the specific information necessary from each source document to determine which source documents should be turned into alerts. For example for a basic KQL rule we'd use the _source param on the request to limit the returned fields (or potentially not return any at all). The _id and _index are sufficient to run the deduplication logic and determine if an alert is new. So what we might do is create new alerts without initially copying the data over from source documents, then have a separate task that enriches newly created alerts with the full source data. Before enrichment, new alerts would contain the kibana.* fields but not much else - maybe for rules that group by certain fields we'd include those as well initially, but that's basically it. After the separate enrichment task runs, the alerts would be identical to the alerts we generate with the current implementation.

With this approach, we'd have much tighter control over the maximum memory usage of the long running rule tasks so there would be less risk of OOM problems when running more rules concurrently on each Kibana node. We can have many rules initiating queries to Elasticsearch concurrently but limit it to run only 1 or a few concurrent tasks for alert enrichment. However, I called it an "extreme" approach because the level of effort to implement this would be high - we need to make modifications to the implementation details of each rule type individually and take into account a variety of features that depend on source document data - suppression, exceptions, host/user enrichment, (others?) to ensure that we still fetch the necessary subset of fields for those features to continue working. We would also have to implement a completely new task to do the post-creation enrichment for new alerts.

However, while this approach would help improve the maximum rule concurrency we can support, it does not address the drastic difference in query time across different rules that also has a significant impact on the amount of concurrency we'd want. If a query is going to take 60s to return results, we might be comfortable having 60x as many of those queries running concurrently from a single Kibana node compared to a query that takes 1s to return results. The net rate of queries/second/Kibana node is equivalent, but with longer queries we need increased concurrency to fully utilize Kibana's resources. We need further research (out of scope for this issue, but related at a high level) to search for good ways to achieve the optimal level of concurrency.

If we don't address this, the impact is that it's more difficult to control if/when Kibana will hit OOM errors. The chance of a Kibana node running out of memory due to the alert creation logic seems low in production because source documents are usually small, rules are typically not all attempting to create alerts simultaneously, and we run a limited number of rules (10) concurrently on each Kibana node. But, we don't have control of the source document size, nor do we have control (right now) of how many rules attempt to create alerts simultaneously, and we want to run more rules concurrently. So if we don't address this we should at least consider what the process of recovering from an OOM error in this scenario looks like, and does that lead to a gap in coverage. Backfill rule runs can cover the outage, but if rules run and cause an OOM once then the backfill runs might be likely to cause OOMs again.