dolittle-platform / Home

0 stars 0 forks source link

Revise alerts granularity and what gets alerted on #115

Open einari opened 5 years ago

einari commented 5 years ago

Today we're getting a lot of alerts that aren't really problems. We need to refine these so that we can trust that when there is an alert, there is a reason for it.

The alert manager itself seems like a bit of an instabile piece of software since it crashes all the time. Kubernetes brings it up again immediately, so its not really down for much.

Optimally, the alert manager shouldn't be down - secondly, it would be more useful if we reported if it or any other pod does not get back after a defined threshold.

┆Issue is synchronized with this Asana task