Icinga / icinga-notifications

Icinga Notifications: new and improved notifications and incident management for Icinga (work in progress, not ready for production yet)
GNU General Public License v2.0
9 stars 0 forks source link

Incident muting (for downtimes, flapping, acknowledgements) #190

Closed julianbrost closed 1 week ago

julianbrost commented 2 months ago

Reviewing the code of #146 and testing it showed some issues that where the proper solution requires some kind of incident muting mechanism.

  1. As we don't track any kind of in-downtime state, downtime notifications get lost if we're not connected to Icinga 2 while the event happens as the event stream catch-up code has no last known state that it could compare the current state from the Icinga 2 API with.
  2. The current implementation does not properly open incidents after a downtime ended as it just completely dismisses the state change event. Doing this properly requires processing the event but not sending notifications, which is what this issue asks for: muting.
  3. State change events from the event stream API don't include whether the checkable is flapping so this is currently not handled. That could be solved by flapping just muting the incident (or the object, as it might flap between OK and something else, yet another question).
nilmerg commented 1 month ago

Yesterday, @bobapple, @lippserd, @julianbrost and me discussed this and agreed on the following:

Extended State Events

Sources will be able to include additional details while transmitting state events.

These details are as follows:

mute: A boolean that indicates whether notification transmission for the associated object should be suppressed. mute_reason: A string with text that provides a reason in case mute: true. This is only used to inform a user.

Repeated Transmission

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

Handling During No Active Incident

None. Yet.

Handling During An Active Incident

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

Escalations are triggered as usual. The transmission state of resulting notifications should remain in a new state called suppressed.

In case notification suppression is disabled and the incident closes, previously suppressed notifications are not transmitted.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

julianbrost commented 1 month ago

Handling During No Active Incident

None. Yet.

The object has to be marked as muted so that if an incident is created, it is handled correctly.

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

On the other hand, if the incident is still open, all notifications that were previously suppressed must now be transmitted at once reflecting the current severity.

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change). Otherwise, it might send multiple notifications and also to no longer active contacts in a schedule.

nilmerg commented 1 month ago

No, that should simply notify about the state from the unmuting event (similar like if it was a severity change)

I've meant exactly that!

julianbrost commented 1 month ago

The transmission state of resulting notifications should remain in a new state called suppressed.

I'd just call it muted for consistent naming within that feature.

As both @nilmerg and @yhabteab have expressed doubts about this suggestion now, let me explain why I came up with it.

To me, notification_state='suppressed' sounds very generic, and notification_state='muted' was my first idea of a more compact representation that would be something like notification_state='suppressed' suppression_reason='object-muted-by-source' in a super verbose form, i.e. to present a more detailed reason in a machine-readable format. So that the history event can provide more information on its own, without having to search for the actual reason in preceding events.

nilmerg commented 1 month ago

I can't follow you, honestly. Please elaborate this, if you like, in person.

julianbrost commented 1 month ago

If more causes of a notification being suppressed are added, how would you represent them in the history? Would they all get notification_state='suppressed'?

Other reasons could be something like "incident was muted manually in the web interface". And we already have another situation where you could argue such a history entry should be added as well: "incident has a manager and regular recipients weren't notified". I'd differentiate these, the question is whether to do this within the notification_state column (notification_state={'source-muted', 'incident-muted', 'incident-managed'}) or with a new column (notification_state='suppressed' and suppression_reason={'source-muted', 'incident-muted', 'incident-managed'}).

Yes, muted was just my first idea, looks like I'd be even more specific with source-muted after thinking about it more.

yhabteab commented 1 month ago

FINE! I'll move on to source-muted then! I've one more question though, the title of the issue also contains Acknowledgements, how do you plan to proceed with this? After all, Ack events aren't handled like any other events, so I've no idea what I'm supposed to do with them.

nilmerg commented 1 month ago

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

the title of the issue also contains Acknowledgements, how do you plan to proceed with this?

This is one of the reasons why an object can be muted. Basically very similar to a native way to mute an object, or incident, explicitly. Though, in this case issued by the source itself. Whether we'll keep the functionality that such can declare a manager, hasn't been discussed...

julianbrost commented 1 month ago

Once notification suppression is switched on, the incident history should indicate this with an appropriate entry. This should also be the case if the incident has just been opened. (i.e. as the second entry)

To me, this should be obvious based on the preceding events. There's is no reason why a specific notification has been suppressed, just one since when any of them are.

I don't really understand what you're saying here on a language level. Do you want to say that there multiple reasons for suppressing a notification may be active at the same time so there's not necessarily a single reason why the notification wasn't sent?

nilmerg commented 1 month ago

No. That's what you're implying by suggesting to define multiple suppression types. I'm referring to this:

Two successive state events with mute: true don't influence each other. Even if mute_reason differs. Only the first is processed with regards to notification suppression.

There is only a single reason at a time. If suppression is enabled, all and any notifications are suppressed. The reason is visible in the history of the incident.

That is why I don't see your argument against suppressed and instead for muted (Or any of the later variances)

julianbrost commented 1 month ago

Oh, I don't want to split the reason why the source muted the object up into multiple reasons. All of the reasons specific to Icinga 2 should still be considered one reason here.

However, when other suppression reasons are added within Notifications, it should be possible to distinguish them. For different reasons, you'd have to look for different previous events for more details ("Icinga 2 muted the object: in downtime", "John Doe muted the incident: [comment written by the user]", ...). There's no plan to add such other reasons as part of this issue, the question is simply whether some other naming would be a better starting point if we want to do this in the future.

As of https://github.com/Icinga/icinga-notifications/issues/190#issuecomment-2126647046, i.e. before I started a discussion about this, how would you have presented a notification event with notification_state='suppressed' in the history view? More like "Notification to John Doe was suppressed" or "Notification to John Doe was suppressed because the source muted the object" or something completely different?

nilmerg commented 1 month ago

e.g. (top to bottom)