DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
340 stars 1.01k forks source link

GKE autopilot taint toleration settings #1003

Open AndyMoreland opened 1 year ago

AndyMoreland commented 1 year ago

Hi folks -- I use the standard datadog helm chart on GKE autopilot. I ran into trouble with the ddog agent daemonset when I started using the "balanced" compute class nodes supported by GKE autopilot.

It looks like GKE autopilot adds a "cloud.google.com/compute-class" taint on non-default-compute-class nodes that it manages, which prevents the ddog agent daemonset pods from being scheduled on those nodes.

Here's what I did to fix this (terraform syntax) --

set {
    name = "datadog.agents.tolerations"
    value = jsonencode([{"effect": "NoSchedule", "key": "cloud.google.com/compute-class", "operator": "Equal", "value": "Balanced"}])
  }

Took me awhile to figure out what was going on. Leaving this here in the hopes that it helps others. I'd advocate for making the helm chart default to tolerating the taints of non-default compute class nodes when autopilot: true is set in values, but I don't know helm chart syntax well enough to make that change.

More info on gke autopilot compute classes here: https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-compute-classes

TobiasLierzer commented 1 year ago

For those not using terraform, you can set the toleration in the helmchart values through .Values.agents.tolerations. In this case:

agents:  
  tolerations:  
    - effect: NoSchedule
      key: cloud.google.com/compute-class
      value: Balanced
      operator: Equal
TobiasLierzer commented 1 year ago

Similar effect for the Scale-Out compute class. In my case, scale out also explicitly added another toleration for amd64 so node taints look like this:

- effect: NoSchedule
  key: cloud.google.com/compute-class
  value: Scale-Out
- effect: NoSchedule
  key: kubernetes.io/arch
  value: amd64

According to docs, both Balanced and Scale-Out can also run on arm64