GKE autopilot v1.25 in australia-southeast1 fails to deploy agent

huib-coalesce commented 1 year ago

Describe what happened: Given a GKE cluster version 1.25.7-gke.1000 in Autopilot mode, in the region australia-southeast1. When I deploy the datadog helm chart, the datadog-agent-cluster-agent gets deployed, but not the datadog-agent DaemonSet.

Describe what you expected: A datadog-agent should be deployed on each node.

Steps to reproduce the issue: Create a cluster using Terraform

terraform init
terraform apply

Using gke.tf

# GKE cluster in Autopilot mode
resource "google_container_cluster" "primary" {
  name     = "test-k8s"
  location = "australia-southeast1"

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
  ip_allocation_policy {}

  enable_autopilot = true

  private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  master_authorized_networks_config {
    cidr_blocks {
      display_name = "public-ip"
      cidr_block = "<redacted>"
    }
  }

  release_channel {
    channel = "REGULAR"
  }

  resource_labels = {
    team = "dev"
    environment = "test"
    origin = "terraform"
  }
}

provider "google" {
  project = "test-project"
  region  = "australia-southeast1"
}

# VPC
resource "google_compute_network" "vpc" {
  name                    = "test-k8s-vpc"
  auto_create_subnetworks = "false"
  routing_mode            = "REGIONAL"
}

# Subnet
resource "google_compute_subnetwork" "subnet" {
  name                     = "test-k8s-subnet"
  region                   = "australia-southeast1"
  network                  = google_compute_network.vpc.name
  ip_cidr_range            = "10.10.0.0/24"
  private_ip_google_access = true
}

resource "google_compute_address" "web" {
  name   = "test-k8s-web"
  region = "australia-southeast1"
}

resource "google_compute_router" "web" {
  name    = "test-k8s-router"
  network = google_compute_network.vpc.id
}

resource "google_compute_router_nat" "web" {
  name                               = "test-k8s-nat"
  router                             = google_compute_router.web.name
  nat_ip_allocate_option             = "MANUAL_ONLY"
  nat_ips                            = [google_compute_address.web.self_link]
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"

  enable_dynamic_port_allocation      = "true"
  enable_endpoint_independent_mapping = "false"
}

This results in the following cluster being created: Release channel: Regular channel Version: 1.25.7-gke.1000

Create kube secrets in the default namespace

kubectl create secret generic datadog-api-key --from-literal api-key="<your api key>"
kubectl create secret generic datadog-app-key --from-literal app-key="<your app key>"

Deploy the helm chart for the agent

helm install datadog-agent datadog/datadog -f datadog-agent-values.yaml --version 3.28.1

with datadog-agent-values.yaml

datadog:
  apiKeyExistingSecret: datadog-api-key
  appKeyExistingSecret: datadog-app-key

  clusterName: test-k8s

  kubeStateMetricsCore:
    enabled: true
  kubeStateMetricsEnabled: false

  logs:
    enabled: false
    containerCollectAll: false

  apm:
    enabled: true
    port: 8126
    socketEnabled: false

  leaderElection: true
  collectEvents: true

  prometheusScrape:
    enabled: false

  processAgent:
    enabled: true
    processCollection: true

  dogstatsd:
    useHostPort: true
    port: 8125
    useSocketVolume: false

clusterAgent:
  enabled: true
  tokenExistingSecret: datadog-cluster-agent-token
  metricsProvider:
    enabled: false
  rbac:
    create: true
  admissionController:
    configMode: service

agents:
  enabled: true
  priorityClassCreate: true
  containers:
    agent:
      resources:
        requests:
          cpu: 20m
          memory: 256Mi
        limits:
          cpu: 200m
          memory: 256Mi

    traceAgent:
      resources:
        requests:
          cpu: 10m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi

    processAgent:
      resources:
        requests:
          cpu: 10m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi

providers:
  gke:
    autopilot: true

Results in

Error from server (GKE Warden constraints violations): error when creating "STDIN": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-no-host-port]":["container agent specifies host ports [8125], which are disallowed in Autopilot.","container trace-agent specifies host ports [8126], which are disallowed in Autopilot."],"[denied by autogke-no-write-mode-hostpath]":["hostPath volume runtimesocketdir used in container agent uses path /var/run/containerd which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume procdir used in container agent uses path /proc which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume cgroups used in container agent uses path /sys/fs/cgroup which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume runtimesocketdir used in container trace-agent uses path /var/run/containerd which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume runtimesocketdir used in container process-agent uses path /var/run/containerd which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume cgroups used in container process-agent uses path /sys/fs/cgroup which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume passwd used in container process-agent uses path /etc/passwd which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume procdir used in container process-agent uses path /proc which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume procdir used in container init-config uses path /proc which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume runtimesocketdir used in container init-config uses path /var/run/containerd which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]}

Additional environment details (Operating System, Cloud provider, etc):

It's like the allow list is never populated:

kubectl get allowlistedworkload
No resources found

Even though the CRD is there:

kubectl get crd/allowlistedworkloads.auto.gke.io
NAME                               CREATED AT
allowlistedworkloads.auto.gke.io   2023-05-03T22:47:37Z

This was working fine for me in the past when creating a cluster with Version 1.24.9-gke.3200 in us-central1: https://github.com/DataDog/helm-charts/issues/947 For that cluster i get:

kubectl get allowlistedworkload
NAME             AGE
aqua             323d
cc               323d
datadog-agents   323d
istio-cni        323d
panw-twistlock   323d
splunk           323d
sysdig-agent     323d

ckeeney commented 1 year ago

I'm running into this when deploying in my own GKE autopilot.

I can see in the deployment logs for my gitlab-agent:

{"error":"failed to create typed patch object (monitoring/datadog-agent; apps/v1, Kind=DaemonSet): errors:
  .spec.template.spec.containers[name=\"agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]
  .spec.template.spec.containers[name=\"trace-agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]
  .spec.template.spec.containers[name=\"process-agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]","group":"apps","kind":"DaemonSet","name":"datadog-agent","namespace":"monitoring","status":"Failed","timestamp":"2023-05-18T18:23:49Z","type":"apply"}

Here is my values.yaml

agents:
  containers:
    agent:
      resources:
        limits:  
          cpu: 100m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 128Mi
datadog:
  apiKeyExistingSecret: datadog-api-key
  apm:
    enabled: true
  logs:
    enabled: true
    containerCollectAll: true
  site: us3.datadoghq.com
providers:
  gke:
    autopilot: true

ckeeney commented 1 year ago

This seems like a bug in the helm chart reproducible whenever you set provides.gke.autopilot=true. I can see the duplicate environment variable in the generated manifests.

I removed the duplicate env var with a Kustomize patch. The patch below is fragile and highly dependent upon the order of the containers and environment variables generated by the helm chart, but it works fine for 3.29.2 and should fail to build if something changes as oppose to generating broken manifests.

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring
helmCharts:
  - name: datadog
    repo: https://helm.datadoghq.com
    version: 3.29.2
    releaseName: datadog-agent
    namespace: monitoring
    valuesFile: values.yaml

patches:
  - path: patches/patch-agent-dd-provider-kind.yaml
    target:
      kind: DaemonSet
      name: datadog-agent

# patches/patch-agent-dd-provider-kind.yaml
# https://github.com/DataDog/helm-charts/issues/1033#issuecomment-1553454404

# patch agent
# test to ensure we are operating on the expected container
- op: test
  path: /spec/template/spec/containers/0/name
  value: agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/0/env/6/name
  value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/0/env/13/name
  value:  DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
  path: /spec/template/spec/containers/0/env/13

# patch trace-agent
# test to ensure we are operating on the expected container
- op: test
  path: /spec/template/spec/containers/1/name
  value: trace-agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/1/env/6/name
  value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/1/env/10/name
  value:  DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
  path: /spec/template/spec/containers/1/env/10

# patch process-agent
# test to ensure we are operating on the expected container
- op: test
  path: /spec/template/spec/containers/2/name
  value: process-agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/2/env/6/name
  value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
  path: /spec/template/spec/containers/2/env/10/name
  value:  DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
  path: /spec/template/spec/containers/2/env/10

soudaburger commented 1 year ago

Came here looking for a solution. We ran into the same problem. Our issue ended up being completely bizarre, and maybe will help someone.

datadog
    envFrom:
      - secretRef:
          name: datadog-custom-secrets

That config above is what caused our agents to be completely unable to deploy to autopilot. It wasn't the existence (or lack thereof) of the secret, but the existence of this configuration in the helm chart actually caused other rendering issues in the YAML and thus resulted in additional mounts or something being misconfigured that autopilot definitely didn't like.

fanny-jiang commented 1 year ago

Hi @huib-coalesce, apologies for the delay. Are you still experiencing the GKE Warden rejection errors when deploying the datadog helm chart in the australia-southeast1 region with Autopilot mode? Have you tried switching to the Rapid release channel or Stable release channel? I'm also unable to reproduce the GKE Warden rejection errors while testing in GKE Autopilot in the australia-southeast1 region.

In my experience, the GKE Warden rejection errors don't always correlate with actual misconfigurations. If you're still experiencing this problem, can you please open a ticket with Datadog Support and provide us with the helm install --dry-run output?

helm install datadog-agent datadog/datadog -f datadog-agent-values.yaml --version 3.28.1 --dry-run

Having the full dry-run output will help us narrow down any misconfigurations in the helm chart.

It's like the allow list is never populated:

Unfortunately, Google removed the ability to view the AllowListedV2Workload object so that is expected behavior.

fanny-jiang commented 1 year ago

Hi @ckeeney, the duplicate DD_PROVIDER_KIND env var issue has been fixed in chart version 3.33.10 thanks to this PR: https://github.com/DataDog/helm-charts/pull/1143.

fanny-jiang commented 1 year ago

Hi @soudaburger, I was able to reproduce your errors--indeed, GKE Autopilot is not happy when datadog.envFrom is used. The GKE Warden errors are confusing because none of the constraint violations seem to be applicable because the setting only adds a envFrom to each container spec and I didn't spot any displacements in the generated manifests. I couldn't find anything in the GKE Autopilot docs about whether envFrom is allowed, but I can check with Google about the field in the Datadog AllowListedV2Workload.

I tested a workaround that works for me:

datadog:
  env:
    - name: DD_FAKE
      valueFrom:
        secretKeyRef:
          name: datadog-custom-secrets
          key: DD_FAKE

I know it's not the most ideal workaround since you'd have to specify each environment variable. In the meantime, I'll continue looking into why datadog.envFrom doesn't work in GKE Autopilot.

huib-coalesce commented 1 year ago

Hi @fanny-jiang we've moved back to using Standard mode, so I have no idea if the issue resolved itself or not. And interesting that Google decided to remove viewing the AllowListedV2Workload. It used to be quite helpful during debugging.

fanny-jiang commented 1 year ago

Hi @huib-coalesce thanks for confirming. I agree, viewing the AllowListedWorkload was very helpful for debugging. I'll go ahead and close this issue.

@soudaburger I'll followup on the envFrom discussion in the GH issue that you opened for this problem: https://github.com/DataDog/helm-charts/issues/1101

DataDog / helm-charts

GKE autopilot v1.25 in australia-southeast1 fails to deploy agent #1033