canonical / charm-prometheus-juju-exporter

Charm that deploys exporter, publishing statistics about juju-deployed machines
Apache License 2.0
1 stars 8 forks source link

Add alert rule for juju machines being down #26

Closed sudeephb closed 1 year ago

sudeephb commented 1 year ago

Alert when the number of machines has decreased compared to 1h ago. The alert will start firing only if the condition holds continuously for 15 minutes. This is to avoid alerts when some machines are flapping(eg. go down and come up within a few minutes). This alert can be tested by removing one or more juju machines. The alert will start firing after 15 minutes.

sudeephb commented 1 year ago

It's an interesting way to detect the warning message. In the traditional way, we will setup a clear threshold for the number of machines. So the warning will be always there until the number come back to normal. But following by this logic, the warning may disappear after 1 hour, is that correct? Because after 1 hour, the comparison result will be the same even the number is incorrect.

Yes, the alert will disappear in an hour. The idea behind this alert is to detect if the machines go down during smooth running of the cloud. The alert being fired for 1 hour should be enough to inform the operator that this has occurred. In cases where this is expected(planned down scaling, maintenance, etc), the operator can mute/ignore the alerts on the receiver end(Pagerduty). Also, we don't know what the 'correct' number is. So, if we keep it on for too long, the alert will keep firing even when the machines are scaled down intentionally.

Pjack commented 1 year ago

It's an interesting way to detect the warning message. In the traditional way, we will setup a clear threshold for the number of machines. So the warning will be always there until the number come back to normal. But following by this logic, the warning may disappear after 1 hour, is that correct? Because after 1 hour, the comparison result will be the same even the number is incorrect.

Yes, the alert will disappear in an hour. The idea behind this alert is to detect if the machines go down during smooth running of the cloud. The alert being fired for 1 hour should be enough to inform the operator that this has occurred. In cases where this is expected(planned down scaling, maintenance, etc), the operator can mute/ignore the alerts on the receiver end(Pagerduty). Also, we don't know what the 'correct' number is. So, if we keep it on for too long, the alert will keep firing even when the machines are scaled down intentionally.

Could we consult one of COE's comment for this behavior? If we design this alert based on a fixed number, they can change the threshold when they want to scale down intentionally. That is the way in nagios solution. If ther alert will disappear in an hour, I don't know if this is everyone's expectation. When the alert disappear, there may be no one remember it and then the broken status will continue for couple days until someone remember it again.

mkalcok commented 1 year ago

This one racked up enough approvals and since it matches with requests from Bootstack, I'm gonna merge it.