AcalephStorage / consul-alerts

A simple daemon to send notifications based on Consul health checks
GNU General Public License v2.0
825 stars 191 forks source link

Spurious Notifications #157

Open jgillard opened 7 years ago

jgillard commented 7 years ago

I've recently created a Notification Profile to direct serfHealth alerts to email and have since been getting a lot of "System is HEALTHY" without the corresponding "System is CRITICAL" emails beforehand.

I enabled debug logging a few days ago and an excerpt is below. You can see that the node was critical at one point but never triggered an alert. When it became stable, after 90 seconds an email was sent. We've never had this problem with PagerDuty, presumedly because it wouldn't have an incident to resolve.

INFO[143330] Registering new health check: node=ip-10-0-204-253, service=, check=Serf Health Status, status=critical 
INFO[143350] ip-10-0-204-253::Serf Health Status is pending status change from  to critical for 20.343651277s. 
INFO[143360] ip-10-0-204-253::Serf Health Status is now pending status change from  to passing. 
INFO[143380] ip-10-0-204-253::Serf Health Status is now pending status change from  to critical. 
INFO[143400] ip-10-0-204-253::Serf Health Status is pending status change from  to critical for 19.755810602s. 
INFO[143419] ip-10-0-204-253::Serf Health Status is now pending status change from  to passing. 
INFO[143439] ip-10-0-204-253::Serf Health Status is pending status change from  to passing for 19.893641664s. 
INFO[143459] ip-10-0-204-253::Serf Health Status is pending status change from  to passing for 39.648083805s. 
INFO[143480] ip-10-0-204-253::Serf Health Status is pending status change from  to passing for 1m0.650671765s. 
INFO[143501] ip-10-0-204-253::Serf Health Status is pending status change from  to passing for 1m21.307562045s. 
INFO[143521] ip-10-0-204-253::Serf Health Status has changed status from  to passing. 
INFO[143543] Getting profile for node:  ip-10-0-204-253  service:    check:  serfHealth 
bdronneau commented 7 years ago

I've the same behaviour but I think it come from the change-threshold https://github.com/AcalephStorage/consul-alerts#health-checks. Indeed, your node is critical for only 20 sec so, critical state never reach threshold. But on passing, you reach 60 sec so notification is send.