pmuellr commented 5 months ago

Recently, Istvan opened PR https://github.com/elastic/kibana/pull/184416 to add a kibana.alert.muted: boolean field to alert documents. The value is intended to hold the value of the mute state of the alert when the alert is created.

While the PR itself is sound, we realized it's a great idea but has some complications, noted below. Check the PR for at least a sense for the sorts of changes that we'd be making.

NAMES, amirite?

Rather than use "mute" or "snooze" here, I think we want to work in the words "action" and "suppression" (or similar). Mute and snooze have meanings already within alerts, but this is really more centered on action execution (or suppression of it) than the alert itself.

Is it just for alert muting / rule muting? Does it matter which one? Or should it include other "action suppression" capabilities like flapping, maintenance windows, conditions associated with actions, etc?

I think the original goal in the PR was to do analysis of alerts that did NOT cause actions to run, so I think we'd want all of these to behave the same, in terms of indicating that an action has been suppressed.

Do we want to track the actual reason why the action was surpressed: alert muting vs rule muting vs flapping, etc?

If we do want to do this, you can imagine a new field action_suppression: keyword whose value would contain the reason why the action was suppressed. Eg, alert-muted, rule-muted, flapping, etc.

One problem/confusion with this are conditional actions, using:

"If alert matches a query"
"If alert is generated during timeframe"

These are per action customizations, not per rule, so for a rule that fires an alert with 3 actions, you can imagine one is not suppressed, one is suppressed because it matches a query, and one is suppressed because it was generated during the chosen timeframe. What would we put in the field then?

We could consider the field to be an array (in source), but then it is a little clumsy to access the other suppression types, which are per rule and so there will only be a single value. But that seems like the safest route.

This is not terribly precise, since we can't associate which action was associated with which action suppression method. But it obviously provides some useful information, and only becomes inprecise when using multiple actions with an alert.

I also just happened to think there are probably some other reasons why we wouldn't run actions; for instance, if a connector is disabled, we obviously won't run an action for it, and I think we'd want this noted as well. This one would be a "per action" suppression (vs "per rule").

I also also happened to think - what about the "only run on status changes" sort of settings? I think these should come into play as well - actions would have been run but due to the rule / system settings (rule setting of "run when status changes"), the actions did NOT run. I believe the "on custom action intervals" would be the same.

Do we want the action suppression state at the time the alert is created, or when the alert is later updated, or both?

The original request was to add the value determined when the alert is created, so there's obviously a desire to have that value available. But it also seems like it would be very useful to have the current values. So we would likely have a initial_action_suppression and current_action_suppression. We'd only update the later field in the alert doc on subsequent rule runs.

Another interesting way to look at this is to thinking about "# of actions run from this alert". An alert without any action supression would have the number of actions run, but an alert with all actions surpressed would have 0 (zero).

That would be a field like actions_suppressed: long. Again, we'd probably want an "initial" and "current" value. We may also want a field like actions_run: long, where the sum of the numbers would be the number of actions for the alert, suppressed or not. Also, a field actions_total: long which would be that same number (number of actions for the alert). Then you could write queries looking for rules that have actions, but didn't run them. Vs rules that have no actions, and thus actions_run would be zero anyway.

We could of course add these fields AND the action_suppression field, but if we can get by with just one flavor, would save some $$$.

Do we also want this in the event log?

I think so, but probably we can defer this.

The event log generates a document for every alert generated for every rule run, so adding an action_suppression field would work for these documents - event.provider: alerting AND event.action: (new-instance, active-instance, recovered-instance). We wouldn't need to neccessarily have separate current and initial values, as the initial value would be in the oldest active-instance document for the alert, as well as in the single new-instance document for the alert.

Note that for the event log rule execution documents contain two fields that will be of interest here:

kibana.alert.rule.execution.metrics.number_of_generated_actions
kibana.alert.rule.execution.metrics.number_of_triggered_actions

Not completely clear if these values are in synch with the thoughts and proposals here. For instance, it looks like disabled connectors aren't counted in either, but feels to me like they should be. But the basic idea is "generated" are the possible actions to run and "triggered" are the ones that were actually selected to run.

Proposal

Add initial_action_suppression and current_action_suppression fields to alert docs as keyword fields. The source will be an array of values, where only "per-action" suppression methods like "if alert matches a query" could have multiple values. The values would be at least the following, with the field value being the union of all the "per-action" suppression methods (when multiple actions are not triggered).

rule-muted
alert-muted
no-status-change
custom-action-interval
flapping
maintenance-window
matches-query
matches-timeframe
connector-disabled

No values would indicate no action suppression took place, but there's is unfortunately no current field indicating how many actions were run. So the value for an alert with no suppression of it's actions would be the same as an alert with no actions at all - and empty array.

I think this is fine for original PR use case, they seem to just want to find alerts that had suppressed actions. And we could consider adding a field for actions run later, if needed

elasticmachine commented 5 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 4 months ago

cc; @shanisagiv1 @joana-cps

rhr323 commented 4 months ago

For our use case of conducting alert analytics (e.g., team health, alert noise, etc.), either alternative would be suitable. The version with initial_action_suppression: keyword appears slightly cleaner to me compared to using a counter like actions_suppressed: long. However, I don’t have a strong preference for either..

Thanks for considering this enhancement; it would be highly valuable for us!

shanisagiv1 commented 4 months ago

Thanks @pmuellr for putting all of this, very detailed! here're my few thoughts:

To your terminology question - I agree that if this PR covers all scenarios when action can be suppressed (even when the alert was created, like flapping or MW) the better naming will be "action.suppressed" rather than "alert.muted".

I also agree with your point related to multiple actions per alert . how do you imagine this array to be structured? my concern is to end up with an array with multiple reasons (each per action), without the ability to correlate which reason relates to which action. wdyt?

Related to your point about initial_action_suppression vs current - why not just having the latest? if I understand correctly they need, they want to filter the view to get all alerts that did fire so they can reverse engineer the noise

And last note - I agree the number of actions might be useful but I guess its more interesting if it was zero or 1+. so starting without it makes sense to me. (I don't think users will try to compare 5 action with 3 emails 2 slack to make sure its alinged. at least not the typical user :))

I'll check with Sec folks if they're looking for such mechanism and if there are any thoughts.

approksiu commented 4 months ago

Thanks @shanisagiv1 and @pmuellr! This information would be useful for security users, for example if they are investigating why some alerts did not trigger a response in SOAR. How would this look for multiple actions set for a rule? Also, I think it is useful to know if the action did not work because its connector failed (in addition to connector-disabled) cc @paulewing

pmuellr commented 4 months ago

How would this look for multiple actions set for a rule?

Yeah, that's where it gets complicated. I'd like to avoid nested objects here, which we'd need to handle the precise description of what happened. What I'm thinking is we would just add values that were appropriate, without precision, so for instance if the alert fired and one connector was disabled and one was not run because it didn't match a query, you'd have ["connector-disabled", "matches-query"], which information on which connector is associated with which value.

I wonder if it would make sense to change these so the current strings I have are prefixes, with a suffix of the relevant connector id. So it might be: ["connector-disabled:8917230981", "matches-query:17290812"] instead. I'm not sure how much this would affect the expected usage here, in dashboards, for instance. Would someone want to do aggs on just the "types" of these (the prefix)? Or is this mainly informational?