Open pmuellr opened 1 month ago
Pinging @elastic/response-ops (Team:ResponseOps)
Thinking about this for a minute, we should have seen some kind of indication of rules NOT running on time, since these presumably weren't. I didn't investigate that. Would be interesting to see what we're capturing for that, and if it did capture the rule delay.
Slack thread with some additional specific info: https://elastic.slack.com/archives/C05MZ28B0BA/p1723057045465409
Some other errors labeled as framework but that should be user; note these are the exact strings I searched for and found them to be tagged "framework-error".
circuit_breaking_exception
missing shards
no_shard_available_action_exception
unable to authenticate
Pinging @elastic/security-detections-response (Team:Detections and Resp)
Moving to security as this will require changes to the executor.
We recently saw a number of messages from SIEM rules, presumably from some deployments/projects under stress:
Executing Rule siem.eqlRule:{{rule-id}} has resulted in the following error(s): 21 minutes (1256420ms) were not queried between this rule execution and the last execution, so signals may have been missed. Consider increasing your look behind time or adding more Kibana instances,21 minutes (1256420ms) were not queried between this rule execution and the last execution, so signals may have been missed. Consider increasing your look behind time or adding more Kibana instances
These feel like user error, but are counted as framework errors. I think if the reason the rule is hitting this is performance reasons, we'll be getting signals elsewhere indicating the problem. I don't think these need to be counted as framework errors for that reason, anyway.