feat: deployment health notifications

dannykopping commented 2 months ago

Relates to https://github.com/coder/coder/issues/13328

Create notification_template instances for the following events, and enqueue notifications with those templates:

[ ] access URL unhealthy
[ ] database connection unhealthy
[ ] DERP unhealthy
[ ] provisioner daemons unhealthy
[ ] websockets unhealthy
[ ] wsproxy unhealthy

Admins should receives these notifications.

johnstcn commented 3 weeks ago

@bpmct @stirby We discussed this in weekly standup:

We believe large organizations are the most likely to benefit from this feature.
However, large organizations are more likely to instead monitor their Coder deployment using Prometheus/Grafana, or hit up the healthcheck deployment directly and integrate it into their existing monitoring solutions.
Deployment administrators will most likely need ways to control consecutive failure thresholds etc. before reporting to avoid spamming their ops-on-call@acme.corp email with useless notifications.

Based on the above, we're not sure how much value this would actually provide. Are aware of any customers requesting this feature? We can most likely support this instead with a custom dashboard in our coder/observability chart.

bpmct commented 3 weeks ago

Are aware of any customers requesting this feature?

There are several customers who have set up their own alerting/monitoring for the health notifications as they've requested limited API token scopes or improved views that are more machine-readable. Because of that, I think there's value in adding these as a customer can use a webhook to send these notifications to their system.

However, large organizations are more likely to instead monitor their Coder deployment using Prometheus/Grafana, or hit up the healthcheck deployment directly and integrate it into their existing monitoring solutions.

Honestly, this argument can be made for all deployment-wide notifications (user created, user deleted, build failed, etc) as something like alert-manager can be used to send these to the proper channel. If anything, I think the deployment health notification is the most important and actionable alert we could do deployment-wide.

bpmct commented 3 weeks ago

It seems like we need to take a stance on the scope for deployment/org wide notifications. There's one argument that the customer's observability stack should handle of these, and another that many large users do not have a mature observability stack or it is a lot of effort, so this reduces the effort necessary.

dannykopping commented 3 weeks ago

It seems like we need to take a stance on the scope for deployment/org wide notifications.

My stance on this is: customers aren't asking for this, so let's defer. Communication is cognitive overhead, and we'll be venturing into the alerting arena here rather than notifications. Alerting is far more subtle because it has to take into account severity, flapping, etc - and this will complicate the design. If the DERP check fails, is it really causing a problem? If it fails and succeeds regularly (flapping), do we notify each time or limit to a certain number of messages a day? Cause-based alerting (which this would be) is generally unhelpful.

I created this issue but the more I think about it, it's a bad idea.

Admins already have the web UI, the API, and - if all else fails - their users, to find out when these healthchecks are failing.

stirby commented 2 days ago

This makes more sense as a summary report. Weekly, admins would receive a report of any downtime on coder services in their deployment. I get these emails from services we use internally, and it's another place for less mature platform teams to check consistency in the coder service.

@dannykopping makes a good point about us not becoming a high-frequency alerting system. Our notifications should be slow, async monitoring tools. What do you all think about converting this into a report?

coder / internal

feat: deployment health notifications #18