Codifying Long Term Silences for AlertManager

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Codifying Long Term Silences for AlertManager #4727

Open wmhutchison opened 7 months ago

wmhutchison commented 7 months ago

Describe the issue As a result of delegating user PodDisruptionBudget notices to end users, we had a need to tell the cluster AlertManager instance to no longer inform Platform Operations about these. To this end we are pursuing codifying such a silence to be injected into the cluster AlertManager configuration so that we no longer need to continually maintain silences in all of the clusters. This EPIC will track all related tickets for existing web-console-entered silences which should really be codified instead.

Additional context Add any other context, attachments or screenshots

How does this benefit the users of our platform? Improved consistency of how Platform Operations manages Openshift clusters.

Definition of done

[ ] All tickets under this EPIC are updated and closed.

wmhutchison commented 3 months ago

While "long term" for this would mean for the duration of a daily maintenance window, here's what has been getting created manually for supporting scenarios where Openshift nodes need to be taken down for an extended period of time (firmware updates for physicals, supporting ESXi host upgrades for VMs).

Silence #1.

alertname=ClusterOperatorDown
name=machine-config

Silence #2

alertname=KubeNodeUnreachable
node~=Node1|Node2|... (regex)

The first silence is applied one-time at the start of the day for maintenance and set to expire near EOD when all nodes are expected to be back up and in use again.

The second silence is applied and then adjusted depending on which batch of nodes are currently out of commission or about to be taken out of commission.

wmhutchison commented 3 months ago

Looking at current emails, the following categories of alerts via AlertManager not going to on-call could theoretically be coded in either easily or with a bit of mental elbow-grease.

CollectorNodeDown (each email contains the specific pod involved with a downed node, could be looked up and added)
NTOPodsNotReady (also contains specific pod names)
SDNPodNotReady (contains both pod name but also node name, so easiest to automate with just node names in regex form)
KubeNodeNotReady (feed it a regex list of nodes, done)

Alerts which will not be automated for silence due to being too generalized.

KubeDaemonSetMisScheduled (just lists affected daemonsets, nothing more specific)
KubeDaemonSetRolloutStuck (same as above)

It also helps that unlike the web console, emails from AlertManager aggregate similar alerts into a single email.

For automating silences, priority would be given to doing so for what on-call receives, then the rest of the list above as time permits, but that frankly is just reducing email output slightly, but still viable to pursue if it can be reliably done without issue.

wmhutchison commented 3 months ago

For all of the above, nothing would be pursued for automation during a regular Openshift upgrade process. The reason being that we have no easy or reliable way of knowing what nodes are being worked on and/or planned next versus firmware/ESXi patching where we control and know which nodes are in play. Also, Openshift maintenance goes pretty fast on a per-node basis, and we likely want all of the stuff unsilenced so we know something is amiss if things get more spammy over time, as that might indicate a stuck node upgrade.

About the only silence that I have used in the past during an Openshift upgrade is the one involving the machine-config CO, but that's just one silence easily applied manually as well.