arachnys / cabot

Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
MIT License
5.59k stars 594 forks source link

[question] For a service, one check goes into error, then another a little later #281

Open blysik opened 8 years ago

blysik commented 8 years ago

Hi,

Just a question on what the behavior is supposed to be.

  1. create a service, and assign it two checks: graphite, and http.
  2. http check goes into error, and then an alert is triggered. (Importance of 'Error'.)
  3. A few moments later, the graphite check failed (importance of critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?

xinity commented 8 years ago

Nop,

Alerts are triggered service based, not check based. So if an alert has already been triggered, then Cabot will not trigger another one even if another check fails until alert_notification time is reached.

@dbuxton please fix me if I'm wrong ;-)

Le ven. 23 oct. 2015 18:53, blysik notifications@github.com a écrit :

Hi,

Just a question on what the behavior is supposed to be.

  1. create a service, and assign it two checks: graphite, and http.
  2. http check goes into error, and then an alert is triggered. (Importance of 'Error'.)
  3. A few moments later, the graphite check failed (importance of critical), however no alert appears to be triggered.

Shouldn't another alert be triggered for 3?

— Reply to this email directly or view it on GitHub https://github.com/arachnys/cabot/issues/281.

blysik commented 8 years ago

Wouldn't that be a problem if the first check was just a warning, and the next check was a critical?

xinity commented 8 years ago

It depends IMHO I maybe be interesting to have an elevation of failure state of a server, like defcon in war movies ;-)

but I don't think this is easy to implement, nor I think it will be widely used.

blysik commented 8 years ago

I think, as currently designed, critical errors might go unnoticed.

  1. Service has a low priority check fail, which generates a warning.
  2. Alert goes out to Ops team.
  3. Ops team sees it's a low priority check, ignores it until morning.
  4. More severe check fails, no alert gets sent.
  5. Ops team doesn't know about it.
dbuxton commented 8 years ago

At the moment we just track as a timestamp the last alert sent (Service.last_alert_sent - see https://github.com/arachnys/cabot/blob/fc33c9859a6c249f8821c88eb8506ebcad645a50/cabot/cabotapp/models.py#L180) - we don't track what kind of alert that was.

It would be easy to also track Service.last_alert_sent_overall_status and compare that to the current to ensure that this issue doesn't occur.

Happy to merge anything that does this, I too think this is a big potential problem. However it won't silence alerts until morning @blysik, just for ALERT_INTERVAL

blysik commented 8 years ago

Aha! ALERT_INTERVAL. Okay, so not as bad.