Closed gravelg closed 5 months ago
The fact that you're observing this only in GKE Autopilot clusters is likely related to the timing of our release rollouts. We introduced a change in #691 that scales alertmanager to zero when rules are not configured using our Rules
, ClusterRules
, or GlobalRules
resources.
Can you provide a bit more information about how you are using Alertmanager? Do you have any of those Rules
configured?
We manage alert rules through Grafana, and so we don't have any of the Rules
objects configured in those clusters. I can try to create a Rules
object and see if the alertmanager pod comes back
It may also be worth checking if the StatefulSet that manages the alertmanager pods still exists: kubectl get -n gke-gmp-system statefulset/alertmanager
The StatefulSet
is indeed still there
❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME READY AGE
alertmanager 0/0 370d
I just applied the example-rule
from the repo and sure enough, alertmanager is back
❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME READY AGE
alertmanager 1/1 555d
I'll try to craft a rule that doesn't actually alert us just to hang around and make sure the alertmanager doesn't scale to 0, unless you have another option I can try
Something like our example rule should be a good starting point.
That will be the best workaround for now. I'll discuss with the team whether it make sense for us to implement another solution for future releases.
Not sure if this is the right place to report such a bug, but we've been using managed alertmanager for a year now, and it seems that in the last few days, the pod has disappeared from a few of our clusters, all GKE Autopilot clusters (if that matters).
Config Secret is still there and unchanged
alertmanager pod is gone
On another non-autopilot cluster:
I also see that the namespace for gmp pods is not the same on an autopilot vs regular cluster, not sure if that has anything to do with it.