[ResponseOps][alerting] Investigate odd behaviors when max alerts limit reached

We had a report of a customer hitting the max alerts per rule execution limit (default/max: 1000). It looked like alert documents were not having their kibana.alert.status field updated from active to recovered.

I did a repo of this by setting the limit to 2 alerts, and creating a rule that would create up to 4 alerts per run. Then manually made them go active/recovered, went over the max, under, etc.

One strange thing is that the recover action message didn't seem to print the {{context.group}} field in the message, meaning it wasn't sent, when an alert recovers, but the execution still hits the max limit. It does seem to be set when the alert does not go over the max limit.

Then at some point, it started running recovery actions for every execution, even though conditions hadn't changed. Perhaps it was "auto-recovering" some of the alerts when at max limit? That might be ok, but doesn't seem like it should happen over and over again when the data being alerted over hasn't changed. However, it could be that all four alerts were active, and it kept recovering two of them and then creating two new ones, each time. Can't really tell as the logs didn't print the group :-(

I can see 5 new-instance events, and 5 recovered-instance events, however there are 10 alert docs.

The alert docs all had kibana.alerti.status: recovered at the end when everything recovered, and active while active - seemed to be consistent. This was an index threshold rule.

Three files here: the contents of the server log, the alerts generated, and the event log docs.

local-console.log.txt

local-alerts.hits.ndjson.json

local-events.hits.ndjson.json

elastic / kibana

[ResponseOps][alerting] Investigate odd behaviors when max alerts limit reached #190258