bosun-monitor / bosun

Time Series Alerting Framework
http://bosun.org
MIT License
3.39k stars 495 forks source link

Add Recovery Emails #333

Closed kylebrandt closed 3 years ago

kylebrandt commented 9 years ago

When an alert instance goes from (Unknown Warning or Critical) to Normal a recovery email should be sent.

Considerations:

My main reservation about this feature is that users are more likely not to investigate an alert that is recovered, this is dangerous because the alert could be a latent issue. However, it is better to provide a better frictionless workflow than a road block. Bosun aims to provide all the tools needed for very informative notifications so good judgements can be made at times without needing to go to a console. Furthermore, we should also add acknowledgement notifications. This will be a way to inform all recipients of an alert that someone has made a decision about this alert and hopefully committed to an action (fixing the actual problem, or tuning the alert).

Ack emails will be described in another issue.

This feature needs discussion and review prior to implementation.

dinoshauer commented 9 years ago

Is this still under consideration? It'd be nice to have bosun auto-resolve with pagerduty :)

acesaro commented 8 years ago

+1

nhproject commented 8 years ago

+10

kylebrandt commented 8 years ago

I still have the same concern that an alert going back to "normal" does mean that the issue has been resolved. We have implemented ack emails. Ideas for the future:

I'm more okay if we have support for these sort of recovery conditions - but that would involve a new factor in an already complicated state machine. So I don't think this is currently in Stack's roadmap.

bbelchak commented 8 years ago

+1 for this!

kristianpaul commented 8 years ago

:+1:

basigabri commented 8 years ago

+1 Had to wake up several times for alerts that were recovered

djohnsen commented 8 years ago

+1 for PagerDuty integration; we too would like to like to tell it that we crossed back over the threshold for a particular incident ID to normal.

We're investigating Bosun to mediate multi-site monitors (New Relic synthetics checking externally-facing pages from multiple sources); initial goal is to require multiple source checks to fail for a period of time before alerting.

Regarding philosophy of this RFE:

I agree the ideal solution is to tune the alert rule and thresholds correctly so it only alerts when it matters, and use the groups/notification timeouts/nexts to set a notification chain.

An Bosun admin can do that; but we're trying to cure an alert deafness problem across a larger team. Enabling recipients to individually tune their alerts based on return to normal is needed (think management escalation paths).

The concern about "not investigating" is a choice that the recipient should be allowed to make on their own - simply provide the information when there's a state change from warn/crit to normal.

Re: the closure issue; leave the incident in open/normal state on the Bosun dashboard - if one chooses to investigate; all the information is there - and future alerts trigger on the same incident and keep the history available.

As near as I can tell; it is not required for someone addressing the underlying condition causing an alert to actually touch Bosun in any way - they can just fix the problem. This would be a boon in our situation. A small Ops squad in a large DevOps team should not create a large Bosun user base that needs regular custom Bosun configs; and most outside our squad wouldn't know how or care to fuss with Bosun.

I might suggest an alert implementation similar to this:

alert is_normal {
    template = norm_pagerduty_tmp
    norm = <any_warn_or_crit_legal_condition>
    normNotification = pagerduty
}

The {{.Alert.Prior.<whatever>}} variables would be available in the template to allow one to indicate in the notification any information from its last alert state.

It might be worth adding the state change from crit->warn as a notification; though this isn't a feature we're presently concerned with.

I hope you'll reconsider the value of this feature.

kylebrandt commented 8 years ago

@djohnsen Can you expand on how your team uses pagerduty. It might help me to understand the integration story there, and in particular why the lack state change notifications hurts there.

Perhaps a notification flag will just send all events (state changes) to that endpoint for any alerts that are associated with alerts going to that notification.

Bosun's state machine is already pretty complex. We have log which sort of bypasses that, but doesn't do anything in regards to normal notifications. We have ignoreUnknown, Dependencies (unevaluated states), squelching, silences. All this is kind of meant to behave in a certain way:

Bosun creates issues, and issues are resolved by humans. A new issue is created when normal goes to a warn, crit, or unknown state (a.k.a Abnormal States). If notification chains exist, notifications will be sent until a human acknowledges the incident.

Currently notifications are triggered when the abnormal state increases (warn->crit, crit->unknown, warn->unknown) within the lifetime of an incident (until a person closes it). A property of this model flap suppression. If we start sending all state changes, the next thing I think people will want is flapping detection.

So add more options to possible incident / alert states isn't trivial since it increases the number of combined states an alert key can be in (and then result in bugs).

But I wonder if maybe people just want to bypass our state machine completely, pipe events to pagerduty, and let it take it from there.

cncook001 commented 8 years ago

Use case: New Relic synthetic monitoring of www.example.com. Every 1 minute 5 New Relic locations (amazon data centers) are doing a "url ping" of www.example.com.

3 New Relic url ping check locations fail, New Relic alert is sent to bosun.

Bosun alert rule "if >50% of locations testing www.example.com fail, then send alert to pager duty".

Bosun sends alert to pager duty since > 50% condition was met. Pager duty policy is set to delay waking someone up for 5 minutes.

1 minute later, 5 New Relic locations (amazon data centers) are doing a "url ping" of www.example.com. All locations pass. "All Clear notice" is sent to bosun for this incident. Bosun sends "all clear" to pager duty for this incident. Pager duty issue resolves. Nobody is woken up.

This "fail, then 1 minute later recovered" incident is displayed in bosun's "history of events" for a human to review and see patterns later.

Bypassing bosun to send alerts directly to pager duty is not the answer. We would like bosun to make rule decisions based on it's data. Pager duty can't do that.

Flap detection built into bosun would be nice, but that is an extra feature request unrelated to this RFE.

kylebrandt commented 8 years ago

@cncook001 In this case I don't understand why an alert is sent to pager duty if people might not be woken up.

kylebrandt commented 8 years ago

Also, relevant to this a blog post on some of my thinking: http://kbrandt.com/post/alert_status/

cncook001 commented 8 years ago

I agree with your blog post. Don't send an alert unless you really have a problem. Also, only send an alert to someone who can do something about it.

In this case I don't understand why an alert is sent to pager duty if people might not be woken up.

New Relic alerting is "very poor". It is a known issue for them, we don't know when New Relic alerting will get any sort of "intelligence". I don't want to wake people up due to their poor alerting. We should throw New Relic out but this is one of the monitoring systems we need to work with.

If bosun can handle recovery messages nicely we will initially implement it to utilize it's alerting engine.

You could change the alert not trigger unless 50% of locations have been down for longer than X.

Agreed. But... what happens when the rule fires because X minutes is true. 1 minute later the condition is resolved? I may have received an alert, saw it on my phone, before I got out of bed the "all clear" alert came though. I don't have to get out of bed, fire up laptop, VPN into site, find dashboard, turn off alert.

As a general rule (for some components) I prefer not to send an alert until it has failed for two test cycles. e.g. SNMP or ping. SNMP may not respond since it is low on the device priority list. The device may be working fine though.

You can also have a warn that is more sensitive, but won't notify pagerduty, and a crit that does

With my first example, this would not matter. New Relic does not have reliable location monitoring. We don't control the Amazon data centers nor do we control the quality of New Relic synthetics.

Another option is to use escalation chains, and only send to pagerduty when a human hasn't acked the alert within a certain amount of time.

We only want a human involved when it really is a problem. Something that goes bad then quickly good again is not an offense we want to wake someone up for. If bosun generates an alert that starts the escalation chain, then the issue is resolved before a human is woken up, how would that work?

kylebrandt commented 8 years ago

Want to make sure I understand some things:

  1. Somehow you use new relic as a data source for bosun. That data source has some built-in availability issues. This leads to some of the alerts to trigger falsely based on observation data missing (and not the site being down).
  2. When this triggers, bosun notifies pagerduty
  3. Pagerduty calls someone

Between 2 and 3, there is some lag, where the pagerduty escalation could have been stopped, but it wasn't.

Is this a correct characterization of the scenario?

kylebrandt commented 8 years ago

3 New Relic url ping check locations fail, New Relic alert is sent to bosun.

Bosun alert rule "if >50% of locations testing www.example.com fail, then send alert to pager duty

That confuses me. How does new relic send an alert to bosun? I'm not clear if the fail logic is in bosun or new relic

cncook001 commented 8 years ago
  1. Correct. We have an app that calls the New Relic API. "give me the results from all locations that tested www.example.com, send those results directly to bosun"
  2. If bosun alert rule == true (for this instance), immediately notify pagerduty.
  3. Pagerduty is also "not smart" with escalation delays. We have a dummy user that receives the first alert. After X minutes if Pagerduty has not received an "all clear" for this incident it escalates to a human who gets woken up.

New Relic absolutely has failed logic. I should be able to say in New Relic "only send an alert out if >50% locations failed", but I can't. I am trying to use Bosun to implement that logic. Bosun certainly can send the alert out, but it can't send the "great, we are working again" message.

gbrayut commented 8 years ago

We don't use Bosun for very many short term, self healing or availability style alerts. We use Pingdom or RainTank for that, which already has recovery alerts and integrates with PagerDuty.

Most of our Bosun alerts are things like disk space, high cpu, puppet errors, etc, which require a human to investigate and verify that the root cause was fixed.

Bosun can create log=true alerts, which are designed to skip the entire alert workflow for things like spikes in exceptions. In that case we dont want a human involved, and it is more of a raw realtime notification (this event happened at this time) and the alert defines how often it triggers (once when > 1000 exceptions in last 5m, runEvery = 1 for 60s trigger time, maxLogFrequency = 10m so it only fires a max of once ever 10m)

That way we still don't need a "recovery" alert, since we watch the graphs and chat notifications and know we are in the clear if it hasn't triggered in the last 10m.

efficks commented 7 years ago

+1 Also add the it would be usefull to notify an external system via HTTP that the system is recovering.

crandles commented 7 years ago

We are also using bosun as a way to relay alerts to various systems operated by different NOC teams; the external systems are used for incident management instead of Bosun, so in this instance we end up ignoring Bosun's state machine.

We are using a fork that uses an "autoClose" flag per each alert.; if an incident is open, and an event with the status of normal is received the incident is closed. This doesn't support custom recovery templates/messages though. Would this be useful/accepted upstream?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.