Closed kr3cj closed 1 week ago
I suppose this could also be a problem with the probes, as opposed to the PDB. Both the clusterchecks and cluster-agent deployments provide liveness, readiness, and startup probes. All with unique endpoints per probe which is good. All had the same probe settings (with the exception of the port). Perhaps the clusterchecks probes could be too simple and aren't accurately portraying when they're live/ready/started?
Another possibility is the KSMCore might just be slow on the uptake. In any case, the issue is that we lose KSM metrics for a tad and it's unpredictable. The impact is that it can trip false alerts with inaccurate or No data.
Changing the agent-clusterchecks PDB from maxUnavailable: 1
to minAvailable: 1
had no effect, which makes sense as they're essentially the same thing. Other things I observed while testing: when doing a rollout restart of the deployment's 3 pods (on existing nodes), I see gaps in kubeStateMetrics between 40 and 60 seconds long during the restart timeframe. The containers in the pods consistently become Ready 1/1
after about 32 seconds on startup. Perhaps there's more to dig into with regard to leader election and Unhealthy probes, I'm not sure.
On a side note, I only saw one restart instance out of 3 where a new pod's liveness and readiness probes failed right after startup with a 500 error but recovered right afterwords and this had no effect on the gaps in metrics or the startup time so is probably an innocuous race condition.
Vendor support had us move the kubeStateMetrics
from running on the clusterchecks deployment to the clusteragent deployment as a workaround . This improved the metrics gap by quite a bit. It eliminated the gap in some metrics and diminished the gap in others.
We accomplished this by simply removing this line in our helm-chart values:
cluster-critical:
datadog:
kubeStateMetricsCore:
enabled: true
- useClusterCheckRunners: true
As an aside, we also suggested a change to the helm chart docs. Since the workaround results in an improvement, we suggested updating the relevant datadog helm chart guidance. In it, it currently says something like:
If clusterChecksRunner.enabled is true, it's recommended to set this flag to true as well to better utilize dedicated workers and reduce load on the Cluster Agent.
That's what we originally did that led to the gap in KSM. But it obviously wasn't the right choice so wanted to let others know until that's updated.
TL;DR
Improve the agent-clusterchecks pod-disruption-budget or probes to make its restart behavior less aggressive in hopes of retaining the
kubernetes_state.*
andkubernetes.*
metric data.Impact
Gaps in kubeStateMetrics during agent-clusterchecks restarts can cause false alerts or misleading dashboards due to inaccurate metrics or missing data.
Background
We run the agent-clusterchecks as a deployment in our EKS clusters (via this official helm chart) and we enable its pod disruption budget. Additionally, to help with resiliency, we spread the deployment across 3 different nodes in 3 different AZs in our AWS environment like so:
Occasionally, when we we upgrade our kubernetes nodes or just restart the agent-clusterchecks deployment, we notice that the
kubernetes_state.*
andkubernetes.*
metrics completely disappear for a minute or few. We have correlated this to the restart behavior of theagent-clusterchecks
deployment. All 3 pods will restart within about a minute or so of each other. Perhaps this doesn't allow enough time to send the metrics back to datadog before being stopped.Not sure if this is related, our agent-clusterchecks configs
We also run the cluster-agent as a deployment in a similar way with a PDB but only run 2 replicas. But we see less aggressive restart behavior of its pods: ```yaml clusterAgent: enabled: true # Datadog recommends 2 replicas and a PDB for HA mode replicas: 2 createPodDisruptionBudget: true # make sure we don't surge a new Node strategy: rollingUpdate: maxSurge: 0 maxUnavailable: 1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: function operator: In values: - cluster-critical podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - datadog-cluster-critical - key: "app.kubernetes.io/component" operator: In values: - cluster-agent topologyKey: "kubernetes.io/hostname" topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: datadog-cluster-critical app.kubernetes.io/component: cluster-agent topologyKey: topology.kubernetes.io/zone maxSkew: 1 whenUnsatisfiable: ScheduleAnyway confd: kube_apiserver_metrics.yaml: |- # Force this to run in the clusterchecks-agent cluster_check: true instances: - prometheus_url: "https://kubernetes.default.svc.cluster.local:443/metrics" ``` ```sh $ kubectl get pdb/datadog-cluster-critical-cluster-agent -ndatadog NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE datadog-cluster-critical-cluster-agent 1 N/A 1 79d $ kubectl rollout restart -ndatadog deploy/datadog-cluster-critical-cluster-agent deployment.apps/datadog-cluster-critical-cluster-agent restarted $ kubectl get po -ndatadog -lapp.kubernetes.io/component=cluster-agent -ojsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.creationTimestamp}{"\n"}{end}' | column -t datadog-cluster-critical-cluster-agent-555f965d8c-jt5fr 2024-08-15T20:12:02Z datadog-cluster-critical-cluster-agent-555f965d8c-mnwbq 2024-08-15T20:11:32Z ``` You can see that in the [cluster-agent PDB](https://github.com/DataDog/helm-charts/blob/main/charts/datadog/templates/cluster-agent-pdb.yaml), it has `minAvailable: 1` instead of the `maxUnavailable: 1` that the [agent-clusterchecks pod-disruption-budget](https://github.com/DataDog/helm-charts/blob/main/charts/datadog/templates/agent-clusterchecks-pdb.yaml) has. Not sure if this is the root cause or not.We would expect that the agent-clusterchecks PDB ensure that metrics aren't lost during a restart of its deployment. Or at least minimize the chance of them being lost.
Troubleshooting
I tried increasing the agent-clustercheck deployment's
startupProbe.initialDelaySeconds
from15
to90
to stagger the startup times, but was still able to reproduce the same problem.