Closed mikecote closed 3 years ago
Pinging @elastic/kibana-stack-services (Team:Stack Services)
Instead of using "throttle" as a terminology, this could be a hook for firing actions only when entering an action group. This could also be used by maps for "entering" and "leaving" a geofenced area instead of notifying (w/ a throttle) when someone is "within" or "outside" a geofenced area.
mentioned here (https://github.com/elastic/kibana/issues/81631) as well, but this would be very useful for a planned "containment"-option of the geo-threshold alert (https://github.com/elastic/kibana/issues/80749).
cc @kmartastic @aaronjcaldwell
@mdefazio Are there mockups for this?
This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.
Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.
If I'm reading this correctly, there seems to be two different suggestions:
I think #2 makes the most sense to me, but I wanted to make sure I've read this correctly.
I'm thinking option 2 as well. At least we can add option 1 on top of the solution in the future if we prefer both. I would be curious of what @arisonl thinks as well.
For some reason, I thought we were just going to change to always do this. Throttle would "break" once the action group changes. If you want "throttle even if alert group changes", I think we were leaving that to the not-yet-implemented "snooze" setting (a timed mute).
Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...
Good idea @pmuellr
This is how I understand the current behavior (please correct me if I'm wrong).
throttle=null
, action(s) will fire each time an alert instance meets the alert criteriathrottle
is defined, action(s) will fire when the alert instance meets the alert criteria and the throttle time has elapsed since the last fired actionWith https://github.com/elastic/kibana/issues/82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.
With https://github.com/elastic/kibana/issues/82412, we will be able to specify conditions per action group, but it's TBD whether changes to the above behavior will be included with that issue.
With #82274, we can now assign actions to different action groups, but that doesn't change any of the above behavior.
I believe changing action groups will reset the throttle window and fire actions as soon as the action group changes.
Gotcha. So if we add a switch or checkbox with this issue, the behavior would be:
That's correct based on how I recall / envisioned it.
@pmuellr
Probably worth putting a doc together on all the things we're thinking about here, which would include current behavior, things like "snooze", and scheduled mutes. I'm a little worried about the combinatorial explosion of all these potential throttle/mute settings ...
I wonder if we can use this meta issue: https://github.com/elastic/kibana/issues/67597. You are right, there will be lots of combinations in the future.
I do not have mocks for this yet. But the behavior you describe seems to make sense to me.
@ymao1 The following two comments capture the user requirements on when to trigger (assuming unmuted) and the perceived state changes from a user perspective:
Your latest comment (https://github.com/elastic/kibana/issues/50077#issuecomment-723193366) is consistent with these requirements, assuming we have the required set of transitions (action groups) in place.
In addition:
if throttle is defined, action(s) will fire when the alert instance meets the alert criteria, action group has changed and the throttle time has elapsed since the last fired action
If alert-on-state-change
flag is false
and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding? @pmuellr @mikecote @mdefazio
If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling? Why is the "action group has changed" condition required here? Shouldn't state-change-based throttling and schedule-based throttling be mutually exclusive? Isn't the combination making the mental model very complicated? What value is the combination adding?
@arisonl You are correct, if alert-on-state-change
is false, it should default to the existing schedule-based throttling. I think I muddled my boolean logic there. It should read
if throttle is defined, action(s) will fire when (the alert instance meets the alert criteria AND the throttle time has elapsed since the last fired action) OR (the action group has changed)
I'm wondering if the terminology should be alert-only-on-state-change
otherwise it can seem like an opt-in thing to also get notifications when the state changes which already happens with throttled alert instances.
I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.
I also wanted to clarify, "state" === "action group" so when mentioning state change, it means changing action groups. I'm ok if the technical terms change to "action group" to avoid further confusion.
@mikecote Thank you. I think that clears up a point of possible confusion that @pmuellr brought up: what is the state
that change that we're notifying on. Answer action group
@arisonl: If alert-on-state-change flag is false and schedule-based throttling is defined, shouldn't this fallback to the existing schedule-based throttling?
Just a note that my thoughts on the schedule-based (ie, cron) silencing of alerts is that it's all muting, not throttling. Feels like schedule-based throttling is going to become a bit of a mind-bender to customers, probably confusing if they do or don't see notifications they weren't expecting, then have to reason backwards as to why. Muting is more of a boolean, seems like a less complex thing to reason about. Select an "calendar entry" of some kind, mute it, now you know - you will NOT get notifications during that period.
It's a simplification anyway :-)
A diagram to illustrate notification behavior with the different throttling conditions:
Just wanted to drop some mock ups of how I saw the UI for this work:
Maybe by default is we no longer "notify every" and we could display the form like this:
If ever the user wishes to be notified every x interval, they check "notify every" and enter a value:
cc @arisonl it would be worth figuring out the default behaviour. Do users want to be "notified every" by default or just when the action group changes? I can see where the throttle is opt-in but there's also that weird use case of not throttling and always notify..
So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess? And there's no way to get the old behavior of throttle: null
where the actions run for every alert execution? That's what "that weird use case of not throttling and always notify.." is? That feels to me like a pretty important case, since it's the default case for every alert today.
It seems like this is going to have to end up being a three-state selector, somehow:
throttle: null, notifyOnlyOnActionGroupChange: false
)throttle: interval, notifyOnlyOnActionGroupChange: false
)throttle: ignored, notifyOnlyOnActionGroupChange: true
)notifyOnlyOnActionGroupChange
is from PR https://github.com/elastic/kibana/pull/82969
Interesting that if we do it like that, you could "toggle" notifyOnlyOnActionGroupChange
without having to destroy the existing throttle
value. Seems like this would be useful, if a customer wants to temporarily switch from case 2 to case 3, and then back to case 2 later.
Alternatively, throttle
could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange
. But then the previous throttle
value WOULD be destroyed. Feels kinda messy as well, "magic string values".
So the unchecked case in the mockup ^^^ is to throttle only on action group changes, I guess?
Correct, if it was labeled "Re-notify every", I think it would be more clear. Unchecked = don't re-notify.
And there's no way to get the old behavior of
throttle: null
where the actions run for every alert execution?That's what "that weird use case of not throttling and always notify.." is?
Yeah, it inherits the previous problem where it's not clear null means notify all the time (not sure if others share the same). In theory null is the same as having the same value as the interval, maybe more clear to change it to that or something?
That feels to me like a pretty important case, since it's the default case for every alert today.
This is where I wished we had telemetry or maybe we could compare default behaviour of other alerting systems to see if they re-notify by default or just the first time. It's mostly to make sure we have the proper default between these 3 options and the user understand / expects the default behaviour.
Alternatively, throttle could become a more complex value - I guess it's an "interval/duration string" now? So we could special case a string like onlyOnActionGroupChange. But then the previous throttle value WOULD be destroyed. Feels kinda messy as well, "magic string values".
Another option could be to present 3 radio buttons and come up with clear labels of the 3 options the user has for how to want to be notified.
I like the idea of explicitly presenting the 3 options somehow (with some additional descriptive words) instead of relying on the user to figure out what combination of checked/unchecked state and throttle input gets the behavior they want.
The 3 radio buttons with clear labels would work. Or maybe a dropdown with 3 values:
and when the third option is selected, the throttle inputs show up.
I have started some designs for this, but I will try these new approaches as well. Knowing the right defaults is definitely an important piece of this. And how often users will want to modify it (Is the check interval more important and throttling can remain at our defaults most of the time? In which case, is interval exposed and throttling is behind an additional click?)
I'm happy changing the default, probably to "notify on action group changes". Both no throttle and throttle on a different interval always seemed kinda clunky to me, probably not what you want most of the time.
I've updated the mockups after some additional help from @ymao1 :
The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule.
I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.
I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).
Those views look good to me.
The custom interval view I think still needs work–—it looks like it's possible to set it so it conflicts with the alert schedule. I also think the copy would greatly benefit from @gchaps . Curious to get the team's thoughts on it at this state though.
That's not new :-). It's always been possible to set illogical throttle values. It's not clear how much validation we should be doing on these, I sort of feel like we should have some visualization or better "explainer" for all the time-related params/config things. That would be the alert interval, throttle, the mythical "snooze", and any alert-specific values like index threshold's window
parameter. Like showing some kind of time-line with the various intervals on tic lines with relative time values on the axis. Kinda thing. I was thinking at one point of trying to overlay some of that on the "preview" graphs, but I think they'll get too busy, especially for non-trivial alerts.
Perhaps what we should do in lieu of "validation" is present a warning/hint for known illogical values, like if the throttle is less than the alert interval.
In any case, worthy of another issue, we don't need to "solve" this problem here.
I think we should also move this to the bottom of the create flyout since this now takes up quite a bit of space. (Hopefully the typical defaults would be good enough for most and they won't have to mess with this).
I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.
I'm leery of separating the alert interval from the throttle - so if the suggestion is we move both, then +1. Otherwise, if interval is at the top, and threshold is at the bottom, doesn't feel right.
Yes, apologies for the confusion. My thought was that the whole block shown in the screenshots would move to the bottom
My thought was that the whole block shown in the screenshots would move to the bottom
Perfect - I think we discussed that in a previous design meeting, and seemed good to me.
This feature request would be to stop being reminded of an alert after it fires. Currently users can disable throttling (throttle:
null
) which causes alerts to fire actions every time or they can throttle alerts for a certain period of time (ex: throttle:5m
).When throttling for a time window, it is considered as "remind me every X minutes" style feature. This request would be to "stop reminding me about this until the action group changes".
For example an alert that constantly fires
default
action group would stop firing actions until the group changes to something other thandefault
. This would be the same as only firing actions when the action group changes. This would be useful for alerts that execute actions that are non-idempotent.This feature could instead be configured at the alert type level instead? This would mean when defining an alert type, you could configure the throttling behaviour.
cc @mdefazio to work on mock ups for this.
Some early thought; adding a checkbox next to "Notify every" in the UI. When unchecked, alert instances are throttled until the state changes.