AcalephStorage / consul-alerts

A simple daemon to send notifications based on Consul health checks
GNU General Public License v2.0
827 stars 191 forks source link

Smart routing alerts #108

Open panda87 opened 8 years ago

panda87 commented 8 years ago

Hi

I have an idea \ question. I'm trying to understand if consul-alerts supported routing alerts to different handlers by the alert severity. For example, if i have cluster with 5 nodes, if one node got outage, for me it's just a "warn" and I want slack notification, since I still have another 4 instances that taking the traffic, but if left only one or any other threshold I set, then it's critical and I want to send the alert as "crit" and send it by OpsGenie handler.

In other words, does the "warn" and "crit" can be effected by the number of "live" services?

This scenario is existing in consul-alerts, or maybe part of it?

Thanks D.

nhproject commented 8 years ago

+1 !

mfischer-zd commented 8 years ago

+1

I'd like to add hysteresis as well; a noisy service (frequent fail/good cycling within a small time window) should not result in an alert storm. Alerts should be windowed. (I can add another issue to cover this if you prefer.)

fusiondog commented 8 years ago

@panda87 It sounds like you are interested in Aggregate Checks as sensu calls them. https://sensuapp.org/docs/0.16/api_aggregates

I don't know that is something consul-alerts should manage. You could make a check that hits the consul API and counts status for a service. Then that check could be what you alert on.

@mfischer-zd Similar to the aggregate, it may be best to write checks that report warn and crit based on hysteresis. The checks are responsible for determining the state. http://planet.nagios.org/archives/85-nagios-ideas/2503-hysteresis-support

Though the description in that link is a little different from what you describe regarding the cycling window. Right now that is handled with a global threshold value that defaults to 60. Which prevents alerts from flapping too much. eg. consul-alerts/config/checks/change-threshold = 30

If that is more what you mean, then it could make some sense to make that setting part of the profiles, so that specific service, check, or hosts can have a different threshold than the default.

panda87 commented 8 years ago

I agree with @mfischer-zd, for us, the ability to send each service to a different Slack channel and emphasise the team responsibility per it's service. In addition, change the interval time per each service is also something that can be useful. My intention is that critical alert on Consul is not real critical when you look at a full cluster with 10 nodes that one node collapsed which is fine by any normal cluster. So, when one server is down the I'd like to see another query that check the x/y, the online server from the amount of cluster and then according to threshold to send warn or crit.

@fusiondog Do you think that somewhen this scenario will be implemented? Thanks

babbottscott commented 6 years ago

Is this doable with notif-profiles' varoverrides now?