Open jguay opened 11 months ago
Pinging @elastic/security-solution (Team: SecuritySolution)
Pinging @elastic/security-detections-response (Team:Detections and Resp)
Pinging @elastic/security-detection-engine (Team:Detection Engine)
@jguay thanks for writing up this use case. We've been discussing the ability to backfill runs and this is certainly a use case we can take into consideration. Once we have a public issue to follow, I will post an update.
cc @approksiu @paulewing
Describe the feature:
Currently if a rule does not run on schedule then the Detection Engine will automatically attempt to cover time that was missed the next time the rule does run. However, failures during a rule execution will not cause the next rule execution to re-search that older window.
Describe a specific use case for the feature:
If rule execution does not run because of elasticsearch exception, next execution should use additional look-back time incremented to look at time since last successful execution A common cause of rule execution failure is
no_shard_available_action_exception
which is more common to take place on elasticsearch cluster whereby creating index takes longer because any index is first created in red health till primary shards become active... A search on data stream when one index is red because execution run just instant after data stream rolled over will cause detection rule execution to fail. If a rule runs every 5 minutes with 1 minute additional look-back - a single execution failure will cause a blind spot of 4 minutes... If the look-back was set to 6 minutes on next execution then no data will be missedWorkaround
Scenario description
00:06:00
will query data from00:00:00
00:03:00
and00:05:30
(@timestamp
) which would trigger alert but execution at00:06:00
failed because of elasticsearch exception00:11:00
and look at data from00:05:00
- it trigger alert for only one of the 2 documents so one alert is missedScenario reproduction
Here is a reproducible scenario on a one-node cluster (without data streams for simplicity) :
create a template for test* index to have 900 primary shards (to make index creation take over 10 seconds during which search will throw exception)
``` PUT _index_template/template_900shards { "index_patterns": [ "test*" ], "template": { "settings": { "number_of_shards": 900, "number_of_replicas": 0 } } } ```create a detection rule triggering when field equals alert
``` POST kbn:/api/alerting/rule { "name": "alert", "tags": [], "consumer": "siem", "schedule": { "interval": "5m" }, "params": { "author": [], "description": "alert", "ruleId": "fa96b731-f8a0-4e7e-91b4-69b3f9f24198", "falsePositives": [], "from": "now-360s", "immutable": false, "license": "", "outputIndex": "", "meta": { "from": "1m", "kibana_siem_app_url": "https://localhost:5601/app/security" }, "maxSignals": 100, "riskScore": 21, "riskScoreMapping": [], "severity": "low", "severityMapping": [], "threat": [], "to": "now", "references": [], "version": 1, "exceptionsList": [], "relatedIntegrations": [], "requiredFields": [], "setup": "", "type": "query", "language": "kuery", "index": [ "test*" ], "query": "field.keyword : \"alert\"", "filters": [] }, "rule_type_id": "siem.queryRule", "notify_when": null, "actions": [] } ```Add a document that should trigger on next execution (document ingested 4 minutes before execution that we will fail)
``` #note the document is ingested 4 minutes before next rule execution which is more than the 1 minute additional look-back setting PUT test2/_doc/willmissalertdoc { "@timestamp" : "2023-11-29T14:47:26.000Z", "field" : "alert" } ```using ID of detection rule, check when is `next_run`
``` GET kbn:/api/alerting/rule/prepare and run the next index document with timestamp 2 seconds before rule execution
``` #previous step "next_run": "2023-11-29T14:51:26.952Z" - running this at 14:51:25 - this took 20 seconds to return - the index was red during those 15 seconds PUT test2/_doc/notmissedalertdoc { "@timestamp" : "2023-11-29T14:51:26.000Z", "field" : "alert" } ```Outcome was that one execution failed in this case with
no_shard_available_action_exception
:And document
notmissedalertdoc
is alerted from on the next rule execution because of the 1 minute additional look-back (on next execution the document is less than 6 minutes old) But documentmissedalertdoc
is missing on next rule execution because it was indexed 9 minutes before next successful execution - so look-back would need to be 6 minutes instead of 1 minute to ensure a single execution failure does not cause missed alerts