GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

rule-evaluator doesn't get updated alertmanager pod ipv4s #866

Open parkedwards opened 8 months ago

parkedwards commented 8 months ago

hello - we're currently using Managed Prometheus and a self-hosted Alertmanager deployment. This has been functioning properly for over a year. We're currently on this version of rule-evaluator

gke.gcr.io/prometheus-engine/rule-evaluator:v0.8.1-gke.9

our rule-evaluator sends events to a self-managed Alertmanager statefulset, which lives in a separate namespace. We configure this via the OperatorConfig CRD:

---
apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
  namespace: gmp-public
  name: config

# https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/doc/api.md#ruleevaluatorspec
rules:
  alerting:
    alertmanagers:
      # configures where the rule-evaluator will send alerting events to
      - name: alertmanager
        namespace: monitoring
        port: 9093

# https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/doc/api.md#managedalertmanagerspec
# NOTE: this section is unused, as it points to an empty default Alertmanager configuration file
# since we are using a self-deployed Alertmanager instead of the one provided by GMP
managedAlertmanager:
  configSecret:
    name: alertmanager
    key: alertmanager.yaml

in the last month or so, we've noticed that the rule-evaluator will be unable to resolve the downstream Alertmanager address after the Alertmanager pod is rescheduled.

From there, we'll see the rule-evaluator log this out:

{
alertmanager: "http://10.34.25.18:9093/api/v2/alerts"
caller: "notifier.go:532"
component: "notifier"
count: 1
err: "Post "http://10.34.25.18:9093/api/v2/alerts": context deadline exceeded"
level: "error"
msg: "Error sending alert"
ts: "2024-02-12T18:17:33.124842335Z"
}

this can go on for an hour - we have pages set up to notify us when the rule-evaluator stops pinging Alertmanager through a custom heartbeat rule. The only way to resolve this is by restarting the rule-evaluator deployment

this suggests that the rule-evaluator is not reconciling downstream ip addresses after startup, since we provide the k8s DNS components in the OperatorConfig for the Alertmanager receiver

pintohutch commented 6 months ago

Hey @parkedwards - thanks for reaching out and apologies for the delayed response.

The rule-evaluator binary uses the same evaluation mechanism, configuration surface, and libraries as Prometheus does; i.e. it's using alertmanager_config verbatim.

So when it comes to BYO alertmanagers, the rule-evaluator is using the same underlying Kubernetes service-discovery as Prometheus does. Specifically, in our stack we take the same approach as prometheus-operator, and use endpoint-based service discovery to find the target for posting alerts to.

Now if the alertmanager pod is rescheduled, presumably its Endpoints object would be updated with the new IP address. The Prometheus libraries use conventional client-go watch clients to ensure they're getting the most up-to-date state of resources from the K8s API server.

Would you be able to check the state of the corresponding Endpoints object for your alertmanager? Does it match the IP of what the rule-evaluator is trying to send to?

volkanakcora commented 6 months ago

Hi @pintohutch,

I just restarted the alert manager, and the current endpoints are:

kubectl get endpoints -n monitoring-alertmanager
NAME           ENDPOINTS                                                     AGE
alertmanager   100.65.113.198:9093,100.65.113.199:9093,100.65.113.200:9093   8d

Here are the logs from the rule evaluator:

2024-05-10 11:03:47.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:47.572593388Z"}

2024-05-10 11:03:52.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:52.572576038Z"}

2024-05-10 11:03:55.856
evaluator {"alertmanager":"http://100.65.113.196:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":8, "err":"Post "http://100.65.113.196:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:03:55.856348962Z"}

2024-05-10 11:03:55.856
evaluator {"alertmanager":"http://100.65.113.197:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":8, "err":"Post "http://100.65.113.197:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:03:55.856512159Z"}

2024-05-10 11:03:57.573
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:57.572745544Z"}

2024-05-10 11:04:02.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:04:02.572195445Z"}

2024-05-10 11:04:05.858
evaluator {"alertmanager":"http://100.65.113.196:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":7, "err":"Post "http://100.65.113.196:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:04:05.858036106Z"}

2024-05-10 11:04:05.858
evaluator {"alertmanager":"http://100.65.113.197:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":7, "err":"Post "http://100.65.113.197:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:04:05.858132095Z"}

It's still pointing out the previous endpoints (before the restart). The issue persists until we restart the rule evaluator.

pintohutch commented 6 months ago

Interesting.

Hey @volkanakcora - could you provide the rule-evaluator config so we can sanity check this on our side?

kubectl get cm -ngmp-system rule-evaluator -oyaml
volkanakcora commented 6 months ago

Hi @pintohutch,

Please find our configs:

kubectl get cm -ngmp-system rule-evaluator -oyaml
apiVersion: v1
data:
  config.yaml: |
    global: {}
    alerting:
        alertmanagers:
            - follow_redirects: true
              enable_http2: true
              scheme: http
              timeout: 10s
              api_version: v2
              static_configs:
                - targets:
                    - alertmanager.gmp-system:9093
            - follow_redirects: true
              enable_http2: true
              scheme: http
              timeout: 10s
              api_version: v2
              relabel_configs:
                - source_labels: [__meta_kubernetes_endpoints_name]
                  regex: alertmanager
                  action: keep
                - source_labels: [__address__]
                  regex: (.+):\d+
                  target_label: __address__
                  replacement: $1:9093
                  action: replace
              kubernetes_sd_configs:
                - role: endpoints
                  kubeconfig_file: ""
                  follow_redirects: true
                  enable_http2: true
                  namespaces:
                    own_namespace: false
                    names:
                        - monitoring-alertmanager
    rule_files:
        - /etc/rules/*.yaml
kind: ConfigMap
metadata:
  creationTimestamp: "2023-12-19T14:31:49Z"
  name: rule-evaluator
  namespace: gmp-system
  resourceVersion: "191689827"
  uid: 32306108-88c5-4a30-b7b1-b6d45d553686

And here's our alert manager config:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alertmanager
spec:
  replicas: 3
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app.kubernetes.io/name: alertmanager
  serviceName: alertmanager
  template:
    metadata:
      labels:
        app.kubernetes.io/name: alertmanager
    spec:
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              name: alert-manager
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: alertmanager
          image: remote-docker.artifactory.dbgcloud.io/prom/alertmanager
          args:
            - "--config.file=/etc/alertmanager/config.yml"
            - "--storage.path=/alertmanager"
            - "--web.listen-address=0.0.0.0:9093"
            - "--web.external-url=$(webExternalUrl)"
            - "--cluster.listen-address=0.0.0.0:9094"
            - "--cluster.peer=alertmanager-0.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
            - "--cluster.peer=alertmanager-1.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
            - "--cluster.peer=alertmanager-2.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
            - "--cluster.peer-timeout=15s"
            - "--cluster.gossip-interval=200ms"
            - "--cluster.pushpull-interval=1m0s"
            - "--cluster.settle-timeout=5s"
            - "--cluster.tcp-timeout=10s"
            - "--cluster.probe-timeout=500ms"
            - "--cluster.probe-interval=1s"
            - "--cluster.reconnect-interval=10s"
            - "--cluster.reconnect-timeout=6h0m0s"
            - "--cluster.label=alertmanager"
          ports:
            - name: web
              containerPort: 9093
            - name: cluster
              containerPort: 9094
          envFrom:
            - configMapRef:
                name: alert-manager-args
            - configMapRef:
                name: http-proxy-env
          resources:
            requests:
              cpu: 100m
              memory: 100M
            limits:
              cpu: 250m
              memory: 250M
          volumeMounts:
            - name: config-volume
              mountPath: /etc/alertmanager
            - name: templates-volume
              mountPath: /etc/alertmanager-templates
            - name: alertmanager
              mountPath: /alertmanager
            - name: slack-secrets
              mountPath: /secrets/slack
            - name: opsgenie-secrets
              mountPath: /secrets/opsgenie
          readinessProbe:
            httpGet:
              path: /-/ready
              port: web
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 1
            failureThreshold: 5
            successThreshold: 1
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: web
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 1
            failureThreshold: 5
            successThreshold: 1
          securityContext:
            allowPrivilegeEscalation: false
            seccompProfile:
              type: RuntimeDefault
            capabilities:
              drop:
                - ALL
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: templates-volume
          configMap:
            name: alertmanager-templates
        - name: alertmanager
          emptyDir: {}
        - name: slack-secrets
          secret:
            secretName: slack-secrets
        - name: opsgenie-secrets
          secret:
            secretName: opsgenie-secrets

Let me know if you need further infos.

pintohutch commented 6 months ago

Hey @volkanakcora - do you have self-monitoring enabled? I'm curious what the value of prometheus_notifications_alertmanagers_discovered{job="rule-evaluator"} is, particularly before, during, and after alertmanager restart.

volkanakcora commented 5 months ago

Before Alert Manager Restart:

{
  __name__="prometheus_notifications_alertmanagers_discovered",
  cluster="cluster-dev",
  container="evaluator",
  instance="rule-evaluator-799bf54847-pdkkf:r-eval-metrics",
  job="rule-evaluator",
  location="europe-west3",
  namespace="gmp-system",
  pod="rule-evaluator-799bf54847-pdkkf",
  project_id="dbg-energy-dev-659ba550"
}

Restarting the Alert Manager:

kubectl -n monitoring-alertmanager rollout restart statefulset alertmanager

After Alert Manager Restart (No Change also during the restart):

{
  __name__="prometheus_notifications_alertmanagers_discovered",
  cluster="cluster-dev",
  container="evaluator",
  instance="rule-evaluator-799bf54847-pdkkf:r-eval-metrics",
  job="rule-evaluator",
  location="europe-west3",
  namespace="gmp-system",
  pod="rule-evaluator-799bf54847-pdkkf",
  project_id="dbg-energy-dev-659ba550"
}

Restarting the Rule Evaluator:

kubectl rollout restart -n gmp-system deployment rule-evaluator

After Rule Evaluator Restart (New Metrics):

{
  __name__="prometheus_notifications_alertmanagers_discovered",
  cluster="cluster-dev",
  container="evaluator",
  instance="rule-evaluator-675998548d-nrkmd:r-eval-metrics",
  job="rule-evaluator",
  location="europe-west3",
  namespace="gmp-system",
  pod="rule-evaluator-675998548d-nrkmd",
  project_id="dbg-energy-dev-659ba550"
}

volkanakcora commented 5 months ago

I just synced them again via cmd line: argocd app sync monitoring-applications/alert-manager --local kubernetes-config/envs/development/alert-manager/ --prune

this time it did not create the same issue, and I also checked the rule evaluator discovery, and it also seems to work.

restart pods

pintohutch commented 5 months ago

Hey @volkanakcora - thanks for posting the metric details above. However, I'm more interested in the graph, which you posted in the subsequent comment. I would suspect in the broken case, that the line stays at 4 (i.e. the discovery isn't toggled for some reason).

I just synced them again via cmd line: argocd app sync monitoring-applications/alert-manager --local kubernetes-config/envs/development/alert-manager/ --prune

I haven't personally used argo before. What does this do and why do you think it may fix the issue?

volkanakcora commented 5 months ago

Hi @pintohutch,

That's actually the expected behavior, right? When the pod count drops to 3 and recovers to 4, it seems like ArgoCD's intervention functioned as intended (similar to a statefulset restart, but through ArgoCD's GitOps workflow).

In my initial comment, I mentioned rescheduling the pods(via command line), but the rule evaluator itself remained unchanged, so I think we could stick with that problem if I understood it right.

pintohutch commented 5 months ago

Oh wait - I actually think your debug logs from your second comment hold the clue. Specifically:

evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:04:02.572195445Z"}

The discovery manager is not able to keep up with service discovery events (changes) in this case. Is your rule-evaluator resource-starved in any way?

We would actually be able to track the delay through discovery metrics, but we don't enable those in our rule-evaluator (we should!) - filed https://github.com/GoogleCloudPlatform/prometheus-engine/issues/973 to do that.

volkanakcora commented 5 months ago

Hi @pintohutch , this is the known logs/errors we get after every alert manager rescheduling.

Not sure if the rule-evaluator has the resource issue. I had thought about it as well, but did not change anything on the rule evaluator side.

Please see the last 7 days metrics for rule evaluator: Rule-evaluator

Do you suggest us to increase the rule evaluator resources, and test it again?

Volkan.

bwplotka commented 5 months ago

Yes please, but it also depends what limits you currently have.

volkanakcora commented 5 months ago

Hi @bwplotka , I'm going to configure VPA for the rule evaluator and test it.

volkanakcora commented 5 months ago

Hi @bwplotka , @pintohutch ,

I have boosted the resources for rule evaluator, alert manager, gmp-operator. However, the result is still the same:

discovery

volkanakcora commented 5 months ago

Restarting the entire alert manager application by deleting and recreating the statefulset seems to resolve the issue, but it's not a guaranteed fix. Sometimes, even this approach fails.

pintohutch commented 5 months ago

Ok thanks for trying and letting us know @volkanakcora. It looks like you're running managed collection on GKE. Are you ok if we take a look from our side to help debug? We can open an internal ticket to track the support work.

I wonder if it's related to https://github.com/prometheus/prometheus/issues/13676...

volkanakcora commented 5 months ago

Hi @pintohutch , it's OK for us, thank you.

It could be related to it, I'm checking it.