[Security Solution] Failed detection rule execution to increment additional look-back of next execution

jguay commented 11 months ago

Describe the feature:

Currently if a rule does not run on schedule then the Detection Engine will automatically attempt to cover time that was missed the next time the rule does run. However, failures during a rule execution will not cause the next rule execution to re-search that older window.

Describe a specific use case for the feature:

If rule execution does not run because of elasticsearch exception, next execution should use additional look-back time incremented to look at time since last successful execution A common cause of rule execution failure is no_shard_available_action_exception which is more common to take place on elasticsearch cluster whereby creating index takes longer because any index is first created in red health till primary shards become active... A search on data stream when one index is red because execution run just instant after data stream rolled over will cause detection rule execution to fail. If a rule runs every 5 minutes with 1 minute additional look-back - a single execution failure will cause a blind spot of 4 minutes... If the look-back was set to 6 minutes on next execution then no data will be missed

Workaround

Increase additional look-back time by the value of alert frequency to allow no blind spot on a single execution failure (enough in most scenarios)
Increase additional look-back time by twice the value of alert frequency to allow no blind spot on 2 consecutive execution failures

Scenario description

Assume a rule that runs with 5 minutes frequency and 1 minute additional look-back
Execution at 00:06:00 will query data from 00:00:00
Assume 2 documents ingested at 00:03:00 and 00:05:30 (@timestamp) which would trigger alert but execution at 00:06:00 failed because of elasticsearch exception
Next execution runs at 00:11:00 and look at data from 00:05:00 - it trigger alert for only one of the 2 documents so one alert is missed

Scenario reproduction

Here is a reproducible scenario on a one-node cluster (without data streams for simplicity) :

create a template for test* index to have 900 primary shards (to make index creation take over 10 seconds during which search will throw exception)

``` PUT _index_template/template_900shards { "index_patterns": [ "test*" ], "template": { "settings": { "number_of_shards": 900, "number_of_replicas": 0 } } } ```

create a detection rule triggering when field equals alert

``` POST kbn:/api/alerting/rule { "name": "alert", "tags": [], "consumer": "siem", "schedule": { "interval": "5m" }, "params": { "author": [], "description": "alert", "ruleId": "fa96b731-f8a0-4e7e-91b4-69b3f9f24198", "falsePositives": [], "from": "now-360s", "immutable": false, "license": "", "outputIndex": "", "meta": { "from": "1m", "kibana_siem_app_url": "https://localhost:5601/app/security" }, "maxSignals": 100, "riskScore": 21, "riskScoreMapping": [], "severity": "low", "severityMapping": [], "threat": [], "to": "now", "references": [], "version": 1, "exceptionsList": [], "relatedIntegrations": [], "requiredFields": [], "setup": "", "type": "query", "language": "kuery", "index": [ "test*" ], "query": "field.keyword : \"alert\"", "filters": [] }, "rule_type_id": "siem.queryRule", "notify_when": null, "actions": [] } ```

Add a document that should trigger on next execution (document ingested 4 minutes before execution that we will fail)

``` #note the document is ingested 4 minutes before next rule execution which is more than the 1 minute additional look-back setting PUT test2/_doc/willmissalertdoc { "@timestamp" : "2023-11-29T14:47:26.000Z", "field" : "alert" } ```

using ID of detection rule, check when is `next_run`

``` GET kbn:/api/alerting/rule/ ```

prepare and run the next index document with timestamp 2 seconds before rule execution

``` #previous step "next_run": "2023-11-29T14:51:26.952Z" - running this at 14:51:25 - this took 20 seconds to return - the index was red during those 15 seconds PUT test2/_doc/notmissedalertdoc { "@timestamp" : "2023-11-29T14:51:26.000Z", "field" : "alert" } ```

Outcome was that one execution failed in this case with no_shard_available_action_exception :

Screenshot 2023-11-29 at 14.53.08.png

And document notmissedalertdoc is alerted from on the next rule execution because of the 1 minute additional look-back (on next execution the document is less than 6 minutes old) But document missedalertdoc is missing on next rule execution because it was indexed 9 minutes before next successful execution - so look-back would need to be 6 minutes instead of 1 minute to ensure a single execution failure does not cause missed alerts

elasticmachine commented 11 months ago

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine commented 9 months ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine commented 9 months ago

Pinging @elastic/security-detection-engine (Team:Detection Engine)

yctercero commented 9 months ago

@jguay thanks for writing up this use case. We've been discussing the ability to backfill runs and this is certainly a use case we can take into consideration. Once we have a public issue to follow, I will post an update.

cc @approksiu @paulewing

elastic / kibana

[Security Solution] Failed detection rule execution to increment additional look-back of next execution #172192

Scenario description

Scenario reproduction