GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

Managed alertmanager no longer running in clusters #990

Closed gravelg closed 5 months ago

gravelg commented 5 months ago

Not sure if this is the right place to report such a bug, but we've been using managed alertmanager for a year now, and it seems that in the last few days, the pod has disappeared from a few of our clusters, all GKE Autopilot clusters (if that matters).

Config Secret is still there and unchanged

❯ k -n gmp-public get secrets
NAME                                         TYPE                                  DATA   AGE
alertmanager                                 Opaque                                1      364d

alertmanager pod is gone

❯ k -n gke-gmp-system get pods
NAME                              READY   STATUS    RESTARTS   AGE
collector-s4wjs                   2/2     Running   0          123m
collector-tdnkl                   2/2     Running   0          9d
collector-vkvhj                   2/2     Running   0          9d
gmp-operator-68988c87ff-7m5q5     1/1     Running   0          9d
rule-evaluator-664c866849-64qks   2/2     Running   0          9d

On another non-autopilot cluster:

❯ k -n gmp-system get pods
NAME                              READY   STATUS    RESTARTS      AGE
alertmanager-0                    2/2     Running   0             9d
collector-bt69f                   2/2     Running   0             9d
collector-fdpdm                   2/2     Running   0             9d
collector-kfhdz                   2/2     Running   0             9d
gmp-operator-6b4cf8fcc4-b6n5t     1/1     Running   0             9d
rule-evaluator-659bf557cf-gmcvt   2/2     Running   2 (20h ago)   9d

I also see that the namespace for gmp pods is not the same on an autopilot vs regular cluster, not sure if that has anything to do with it.

bernot-dev commented 5 months ago

The fact that you're observing this only in GKE Autopilot clusters is likely related to the timing of our release rollouts. We introduced a change in #691 that scales alertmanager to zero when rules are not configured using our Rules, ClusterRules, or GlobalRules resources.

Can you provide a bit more information about how you are using Alertmanager? Do you have any of those Rules configured?

gravelg commented 5 months ago

We manage alert rules through Grafana, and so we don't have any of the Rules objects configured in those clusters. I can try to create a Rules object and see if the alertmanager pod comes back

bernot-dev commented 5 months ago

It may also be worth checking if the StatefulSet that manages the alertmanager pods still exists: kubectl get -n gke-gmp-system statefulset/alertmanager

gravelg commented 5 months ago

The StatefulSet is indeed still there

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   0/0     370d

I just applied the example-rule from the repo and sure enough, alertmanager is back

❯ kubectl get -n gke-gmp-system statefulset/alertmanager
NAME           READY   AGE
alertmanager   1/1     555d

I'll try to craft a rule that doesn't actually alert us just to hang around and make sure the alertmanager doesn't scale to 0, unless you have another option I can try

bernot-dev commented 5 months ago

Something like our example rule should be a good starting point.

That will be the best workaround for now. I'll discuss with the team whether it make sense for us to implement another solution for future releases.