Gaps in kube-state-metrics reporting when agent-clusterchecks deployment restarts

kr3cj commented 1 month ago

TL;DR

Improve the agent-clusterchecks pod-disruption-budget or probes to make its restart behavior less aggressive in hopes of retaining the kubernetes_state.* and kubernetes.* metric data.

Impact

Gaps in kubeStateMetrics during agent-clusterchecks restarts can cause false alerts or misleading dashboards due to inaccurate metrics or missing data.

Background

We run the agent-clusterchecks as a deployment in our EKS clusters (via this official helm chart) and we enable its pod disruption budget. Additionally, to help with resiliency, we spread the deployment across 3 different nodes in 3 different AZs in our AWS environment like so:

  datadog:
    kubeStateMetricsCore:
      enabled: true
      useClusterCheckRunners: true

  clusterChecksRunner:
    enabled: true
    # Datadog recommends 2 replicas and a PDB for HA mode
    replicas: 3
    createPodDisruptionBudget: true

    # make sure we don't surge a new Node
    strategy:
      rollingUpdate:
        maxSurge: 0
        maxUnavailable: 1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: function
                  operator: In
                  values:
                    - cluster-critical
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: "app.kubernetes.io/name"
                  operator: In
                  values:
                    - datadog-cluster-critical
                - key: "app.kubernetes.io/component"
                  operator: In
                  values:
                    - clusterchecks-agent
            topologyKey: "kubernetes.io/hostname"
    topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: datadog-cluster-critical
            app.kubernetes.io/component: clusterchecks-agent
        topologyKey: topology.kubernetes.io/zone
        maxSkew: 1
        whenUnsatisfiable: ScheduleAnyway

Occasionally, when we we upgrade our kubernetes nodes or just restart the agent-clusterchecks deployment, we notice that the kubernetes_state.* and kubernetes.* metrics completely disappear for a minute or few. We have correlated this to the restart behavior of the agent-clusterchecks deployment. All 3 pods will restart within about a minute or so of each other. Perhaps this doesn't allow enough time to send the metrics back to datadog before being stopped.

$ kubectl get pdb/datadog-cluster-critical-clusterchecks -ndatadog
NAME                                     MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
datadog-cluster-critical-clusterchecks   N/A             1                 1                     79d
$ kubectl rollout restart -ndatadog deploy/datadog-cluster-critical-clusterchecks
deployment.apps/datadog-cluster-critical-clusterchecks restarted
$ kubectl get po -ndatadog -lapp.kubernetes.io/component=clusterchecks-agent -ojsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.creationTimestamp}{"\n"}{end}' | column -t
datadog-cluster-critical-clusterchecks-6bb6c7bb67-grczn  2024-08-15T20:08:26Z
datadog-cluster-critical-clusterchecks-6bb6c7bb67-vjwr5  2024-08-15T20:07:39Z
datadog-cluster-critical-clusterchecks-6bb6c7bb67-wxt77  2024-08-15T20:07:07Z

UnhealthyDeployment

Missing_kubernetes_state_metrics

Missing_kubernetes_metrics

Not sure if this is related, our agent-clusterchecks configs

We also run the cluster-agent as a deployment in a similar way with a PDB but only run 2 replicas. But we see less aggressive restart behavior of its pods: ```yaml clusterAgent: enabled: true # Datadog recommends 2 replicas and a PDB for HA mode replicas: 2 createPodDisruptionBudget: true # make sure we don't surge a new Node strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: function operator: In values: - cluster-critical podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - datadog-cluster-critical - key: "app.kubernetes.io/component" operator: In values: - cluster-agent topologyKey: "kubernetes.io/hostname" topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: datadog-cluster-critical app.kubernetes.io/component: cluster-agent topologyKey: topology.kubernetes.io/zone maxSkew: 1 whenUnsatisfiable: ScheduleAnyway confd: kube_apiserver_metrics.yaml: |- # Force this to run in the clusterchecks-agent cluster_check: true instances: - prometheus_url: "https://kubernetes.default.svc.cluster.local:443/metrics" ``` ```sh $ kubectl get pdb/datadog-cluster-critical-cluster-agent -ndatadog NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE datadog-cluster-critical-cluster-agent 1 N/A 1 79d $ kubectl rollout restart -ndatadog deploy/datadog-cluster-critical-cluster-agent deployment.apps/datadog-cluster-critical-cluster-agent restarted $ kubectl get po -ndatadog -lapp.kubernetes.io/component=cluster-agent -ojsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.creationTimestamp}{"\n"}{end}' | column -t datadog-cluster-critical-cluster-agent-555f965d8c-jt5fr 2024-08-15T20:12:02Z datadog-cluster-critical-cluster-agent-555f965d8c-mnwbq 2024-08-15T20:11:32Z ``` You can see that in the [cluster-agent PDB](https://github.com/DataDog/helm-charts/blob/main/charts/datadog/templates/cluster-agent-pdb.yaml), it has `minAvailable: 1` instead of the `maxUnavailable: 1` that the [agent-clusterchecks pod-disruption-budget](https://github.com/DataDog/helm-charts/blob/main/charts/datadog/templates/agent-clusterchecks-pdb.yaml) has. Not sure if this is the root cause or not.

We would expect that the agent-clusterchecks PDB ensure that metrics aren't lost during a restart of its deployment. Or at least minimize the chance of them being lost.

Troubleshooting

I tried increasing the agent-clustercheck deployment's startupProbe.initialDelaySeconds from 15 to 90 to stagger the startup times, but was still able to reproduce the same problem.

kr3cj commented 1 month ago

I suppose this could also be a problem with the probes, as opposed to the PDB. Both the clusterchecks and cluster-agent deployments provide liveness, readiness, and startup probes. All with unique endpoints per probe which is good. All had the same probe settings (with the exception of the port). Perhaps the clusterchecks probes could be too simple and aren't accurately portraying when they're live/ready/started?

Another possibility is the KSMCore might just be slow on the uptake. In any case, the issue is that we lose KSM metrics for a tad and it's unpredictable. The impact is that it can trip false alerts with inaccurate or No data.

kr3cj commented 3 weeks ago

Changing the agent-clusterchecks PDB from maxUnavailable: 1 to minAvailable: 1 had no effect, which makes sense as they're essentially the same thing. Other things I observed while testing: when doing a rollout restart of the deployment's 3 pods (on existing nodes), I see gaps in kubeStateMetrics between 40 and 60 seconds long during the restart timeframe. The containers in the pods consistently become Ready 1/1 after about 32 seconds on startup. Perhaps there's more to dig into with regard to leader election and Unhealthy probes, I'm not sure.

On a side note, I only saw one restart instance out of 3 where a new pod's liveness and readiness probes failed right after startup with a 500 error but recovered right afterwords and this had no effect on the gaps in metrics or the startup time so is probably an innocuous race condition.

kr3cj commented 1 week ago

Vendor support had us move the kubeStateMetrics from running on the clusterchecks deployment to the clusteragent deployment as a workaround . This improved the metrics gap by quite a bit. It eliminated the gap in some metrics and diminished the gap in others.

We accomplished this by simply removing this line in our helm-chart values:

cluster-critical:
  datadog:
    kubeStateMetricsCore:
      enabled: true
-     useClusterCheckRunners: true

As an aside, we also suggested a change to the helm chart docs. Since the workaround results in an improvement, we suggested updating the relevant datadog helm chart guidance. In it, it currently says something like:

If clusterChecksRunner.enabled is true, it's recommended to set this flag to true as well to better utilize dedicated workers and reduce load on the Cluster Agent.

That's what we originally did that led to the gap in KSM. But it obviously wasn't the right choice so wanted to let others know until that's updated.

DataDog / helm-charts