elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.65k stars 8.23k forks source link

Ability to throttle alert instances until action group changes #50077

Closed mikecote closed 3 years ago

mikecote commented 5 years ago

This feature request would be to stop being reminded of an alert after it fires. Currently users can disable throttling (throttle: null) which causes alerts to fire actions every time or they can throttle alerts for a certain period of time (ex: throttle: 5m).

When throttling for a time window, it is considered as "remind me every X minutes" style feature. This request would be to "stop reminding me about this until the action group changes".

For example an alert that constantly fires default action group would stop firing actions until the group changes to something other than default. This would be the same as only firing actions when the action group changes. This would be useful for alerts that execute actions that are non-idempotent.

This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.

cc @mdefazio to work on mock ups for this.

Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.

elasticmachine commented 5 years ago

Pinging @elastic/kibana-stack-services (Team:Stack Services)

mikecote commented 4 years ago

Instead of using "throttle" as a terminology, this could be a hook for firing actions only when entering an action group. This could also be used by maps for "entering" and "leaving" a geofenced area instead of notifying (w/ a throttle) when someone is "within" or "outside" a geofenced area.

thomasneirynck commented 4 years ago

mentioned here (https://github.com/elastic/kibana/issues/81631) as well, but this would be very useful for a planned "containment"-option of the geo-threshold alert (https://github.com/elastic/kibana/issues/80749).

cc @kmartastic @aaronjcaldwell

ymao1 commented 4 years ago

@mdefazio Are there mockups for this?

ymao1 commented 4 years ago

This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.

Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.

If I'm reading this correctly, there seems to be two different suggestions:

  1. Configuring at the alert type level, so each alert type would determine whether to notify on state change
  2. Adding a checkbox next to "Notify every" implies to me this is a configuration across all alerts (since this configuration exists outside of the alert type specific settings).

I think #2 makes the most sense to me, but I wanted to make sure I've read this correctly.

mikecote commented 4 years ago

I'm thinking option 2 as well. At least we can add option 1 on top of the solution in the future if we prefer both. I would be curious of what @arisonl thinks as well.

pmuellr commented 4 years ago

For some reason, I thought we were just going to change to always do this. Throttle would "break" once the action group changes. If you want "throttle even if alert group changes", I think we were leaving that to the not-yet-implemented "snooze" setting (a timed mute).

Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...

ymao1 commented 4 years ago

Good idea @pmuellr

This is how I understand the current behavior (please correct me if I'm wrong).

With https://github.com/elastic/kibana/issues/82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.

With https://github.com/elastic/kibana/issues/82412, we will be able to specify conditions per action group, but it's TBD whether changes to the above behavior will be included with that issue.

mikecote commented 4 years ago

With #82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.

I believe changing action groups will reset the throttle window and fire actions as soon as the action group changes.

ymao1 commented 4 years ago

Gotcha. So if we add a switch or checkbox with this issue, the behavior would be:

mikecote commented 4 years ago

That's correct based on how I recall / envisioned it.

@pmuellr

Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...

I wonder if we can use this meta issue: https://github.com/elastic/kibana/issues/67597. You are right, there will be lots of combinations in the future.

mdefazio commented 4 years ago

I do not have mocks for this yet. But the behavior you describe seems to make sense to me.

arisonl commented 4 years ago

@ymao1 The following two comments capture the user requirements on when to trigger (assuming unmuted) and the perceived state changes from a user perspective:

Your latest comment (https://github.com/elastic/kibana/issues/50077#issuecomment-723193366) is consistent with these requirements, assuming we have the required set of transitions (action groups) in place.

In addition:

if throttle is defined, action(s) will fire when the alert instance meets the alert criteria, action group has changed and the throttle time has elapsed since the last fired action

If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding? @pmuellr @mikecote @mdefazio

ymao1 commented 4 years ago

If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding?

@arisonl You are correct, if alert-on-state-change is false, it should default to the existing schedule-based throttling. I think I muddled my boolean logic there. It should read

if throttle is defined, action(s) will fire when (the alert instance meets the alert criteria AND the throttle time has elapsed since the last fired action) OR (the action group has changed)

mikecote commented 4 years ago

I'm wondering if the terminology should be alert-only-on-state-change otherwise it can seem like an opt-in thing to also get notifications when the state changes which already happens with throttled alert instances.

I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.

ymao1 commented 4 years ago

I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.

@mikecote Thank you. I think that clears up a point of possible confusion that @pmuellr brought up: what is the state that change that we're notifying on. Answer action group

pmuellr commented 4 years ago

@arisonl: If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling?

Just a note that my thoughts on the schedule-based (ie, cron) silencing of alerts is that it's all muting, not throttling. Feels like schedule-based throttling is going to become a bit of a mind-bender to customers, probably confusing if they do or don't see notifications they weren't expecting, then have to reason backwards as to why. Muting is more of a boolean, seems like a less complex thing to reason about. Select an "calendar entry" of some kind, mute it, now you know - you will NOT get notifications during that period.

It's a simplification anyway :-)

ymao1 commented 4 years ago

A diagram to illustrate notification behavior with the different throttling conditions:

Screen Shot 2020-11-12 at 1 23 39 PM
mikecote commented 3 years ago

Just wanted to drop some mock ups of how I saw the UI for this work:

Maybe by default is we no longer "notify every" and we could display the form like this:

Screen Shot 2020-11-23 at 4 42 24 PM

If ever the user wishes to be notified every x interval, they check "notify every" and enter a value:

Screen Shot 2020-11-23 at 4 41 45 PM
mikecote commented 3 years ago

cc @arisonl it would be worth figuring out the default behaviour. Do users want to be "notified every" by default or just when the action group changes? I can see where the throttle is opt-in but there's also that weird use case of not throttling and always notify..

pmuellr commented 3 years ago

So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess? And there's no way to get the old behavior of throttle: null where the actions run for every alert execution? That's what "that weird use case of not throttling and always notify.." is? That feels to me like a pretty important case, since it's the default case for every alert today.

It seems like this is going to have to end up being a three-state selector, somehow:

  1. no throttle at all (throttle: null, notifyOnlyOnActionGroupChange: false)
  2. no throttle on interval (throttle: interval, notifyOnlyOnActionGroupChange: false)
  3. throttle till action group changes (throttle: ignored, notifyOnlyOnActionGroupChange: true)

notifyOnlyOnActionGroupChange is from PR https://github.com/elastic/kibana/pull/82969

Interesting that if we do it like that, you could "toggle" notifyOnlyOnActionGroupChange without having to destroy the existing throttle value. Seems like this would be useful, if a customer wants to temporarily switch from case 2 to case 3, and then back to case 2 later.

Alternatively, throttle could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange. But then the previous throttle value WOULD be destroyed. Feels kinda messy as well, "magic string values".

mikecote commented 3 years ago

So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess?

Correct, if it was labeled "Re-notify every", I think it would be more clear. Unchecked = don't re-notify.

And there's no way to get the old behavior of throttle: null where the actions run for every alert execution?

That's what "that weird use case of not throttling and always notify.." is?

Yeah, it inherits the previous problem where it's not clear null means notify all the time (not sure if others share the same). In theory null is the same as having the same value as the interval, maybe more clear to change it to that or something?

That feels to me like a pretty important case, since it's the default case for every alert today.

This is where I wished we had telemetry or maybe we could compare default behaviour of other alerting systems to see if they re-notify by default or just the first time. It's mostly to make sure we have the proper default between these 3 options and the user understand / expects the default behaviour.

Alternatively, throttle could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange. But then the previous throttle value WOULD be destroyed. Feels kinda messy as well, "magic string values".

Another option could be to present 3 radio buttons and come up with clear labels of the 3 options the user has for how to want to be notified.

ymao1 commented 3 years ago

I like the idea of explicitly presenting the 3 options somehow (with some additional descriptive words) instead of relying on the user to figure out what combination of checked/unchecked state and throttle input gets the behavior they want.

The 3 radio buttons with clear labels would work. Or maybe a dropdown with 3 values:

and when the third option is selected, the throttle inputs show up.

mdefazio commented 3 years ago

I have started some designs for this, but I will try these new approaches as well. Knowing the right defaults is definitely an important piece of this. And how often users will want to modify it (Is the check interval more important and throttling can remain at our defaults most of the time? In which case, is interval exposed and throttling is behind an additional click?)

pmuellr commented 3 years ago

I'm happy changing the default, probably to "notify on action group changes". Both no throttle and throttle on a different interval always seemed kinda clunky to me, probably not what you want most of the time.

mdefazio commented 3 years ago

I've updated the mockups after some additional help from @ymao1 : The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule.
I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.

I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).

image

pmuellr commented 3 years ago

Those views look good to me.

The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule. I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.

That's not new :-). It's always been possible to set illogical throttle values. It's not clear how much validation we should be doing on these, I sort of feel like we should have some visualization or better "explainer" for all the time-related params/config things. That would be the alert interval, throttle, the mythical "snooze", and any alert-specific values like index threshold's window parameter. Like showing some kind of time-line with the various intervals on tic lines with relative time values on the axis. Kinda thing. I was thinking at one point of trying to overlay some of that on the "preview" graphs, but I think they'll get too busy, especially for non-trivial alerts.

Perhaps what we should do in lieu of "validation" is present a warning/hint for known illogical values, like if the throttle is less than the alert interval.

In any case, worthy of another issue, we don't need to "solve" this problem here.

I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).

I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.

mdefazio commented 3 years ago

I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.

Yes, apologies for the confusion. My thought was that the whole block shown in the screenshots would move to the bottom

pmuellr commented 3 years ago

My thought was that the whole block shown in the screenshots would move to the bottom

Perfect - I think we discussed that in a previous design meeting, and seemed good to me.