Closed huib-coalesce closed 1 year ago
I'm running into this when deploying in my own GKE autopilot.
I can see in the deployment logs for my gitlab-agent
:
{"error":"failed to create typed patch object (monitoring/datadog-agent; apps/v1, Kind=DaemonSet): errors:
.spec.template.spec.containers[name=\"agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]
.spec.template.spec.containers[name=\"trace-agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]
.spec.template.spec.containers[name=\"process-agent\"].env: duplicate entries for key [name=\"DD_PROVIDER_KIND\"]","group":"apps","kind":"DaemonSet","name":"datadog-agent","namespace":"monitoring","status":"Failed","timestamp":"2023-05-18T18:23:49Z","type":"apply"}
Here is my values.yaml
agents:
containers:
agent:
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
datadog:
apiKeyExistingSecret: datadog-api-key
apm:
enabled: true
logs:
enabled: true
containerCollectAll: true
site: us3.datadoghq.com
providers:
gke:
autopilot: true
This seems like a bug in the helm chart reproducible whenever you set provides.gke.autopilot=true
. I can see the duplicate environment variable in the generated manifests.
I removed the duplicate env var with a Kustomize patch. The patch below is fragile and highly dependent upon the order of the containers and environment variables generated by the helm chart, but it works fine for 3.29.2
and should fail to build if something changes as oppose to generating broken manifests.
# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring
helmCharts:
- name: datadog
repo: https://helm.datadoghq.com
version: 3.29.2
releaseName: datadog-agent
namespace: monitoring
valuesFile: values.yaml
patches:
- path: patches/patch-agent-dd-provider-kind.yaml
target:
kind: DaemonSet
name: datadog-agent
# patches/patch-agent-dd-provider-kind.yaml
# https://github.com/DataDog/helm-charts/issues/1033#issuecomment-1553454404
# patch agent
# test to ensure we are operating on the expected container
- op: test
path: /spec/template/spec/containers/0/name
value: agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/0/env/6/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/0/env/13/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
path: /spec/template/spec/containers/0/env/13
# patch trace-agent
# test to ensure we are operating on the expected container
- op: test
path: /spec/template/spec/containers/1/name
value: trace-agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/1/env/6/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/1/env/10/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
path: /spec/template/spec/containers/1/env/10
# patch process-agent
# test to ensure we are operating on the expected container
- op: test
path: /spec/template/spec/containers/2/name
value: process-agent
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/2/env/6/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: test
path: /spec/template/spec/containers/2/env/10/name
value: DD_PROVIDER_KIND
# test to ensure the DD_PROVIDER_KIND env var is where we expect
- op: remove
path: /spec/template/spec/containers/2/env/10
Came here looking for a solution. We ran into the same problem. Our issue ended up being completely bizarre, and maybe will help someone.
datadog
envFrom:
- secretRef:
name: datadog-custom-secrets
That config above is what caused our agents to be completely unable to deploy to autopilot. It wasn't the existence (or lack thereof) of the secret, but the existence of this configuration in the helm chart actually caused other rendering issues in the YAML and thus resulted in additional mounts or something being misconfigured that autopilot definitely didn't like.
Hi @huib-coalesce, apologies for the delay. Are you still experiencing the GKE Warden rejection errors when deploying the datadog helm chart in the australia-southeast1
region with Autopilot mode? Have you tried switching to the Rapid release channel or Stable release channel? I'm also unable to reproduce the GKE Warden rejection errors while testing in GKE Autopilot in the australia-southeast1
region.
In my experience, the GKE Warden rejection errors don't always correlate with actual misconfigurations. If you're still experiencing this problem, can you please open a ticket with Datadog Support and provide us with the helm install --dry-run output?
helm install datadog-agent datadog/datadog -f datadog-agent-values.yaml --version 3.28.1 --dry-run
Having the full dry-run output will help us narrow down any misconfigurations in the helm chart.
It's like the allow list is never populated:
Unfortunately, Google removed the ability to view the AllowListedV2Workload
object so that is expected behavior.
Hi @ckeeney, the duplicate DD_PROVIDER_KIND
env var issue has been fixed in chart version 3.33.10
thanks to this PR: https://github.com/DataDog/helm-charts/pull/1143.
Hi @soudaburger, I was able to reproduce your errors--indeed, GKE Autopilot is not happy when datadog.envFrom
is used. The GKE Warden errors are confusing because none of the constraint violations seem to be applicable because the setting only adds a envFrom
to each container spec and I didn't spot any displacements in the generated manifests. I couldn't find anything in the GKE Autopilot docs about whether envFrom
is allowed, but I can check with Google about the field in the Datadog AllowListedV2Workload.
I tested a workaround that works for me:
datadog:
env:
- name: DD_FAKE
valueFrom:
secretKeyRef:
name: datadog-custom-secrets
key: DD_FAKE
I know it's not the most ideal workaround since you'd have to specify each environment variable. In the meantime, I'll continue looking into why datadog.envFrom
doesn't work in GKE Autopilot.
Hi @fanny-jiang we've moved back to using Standard mode, so I have no idea if the issue resolved itself or not. And interesting that Google decided to remove viewing the AllowListedV2Workload. It used to be quite helpful during debugging.
Hi @huib-coalesce thanks for confirming. I agree, viewing the AllowListedWorkload was very helpful for debugging. I'll go ahead and close this issue.
@soudaburger I'll followup on the envFrom
discussion in the GH issue that you opened for this problem: https://github.com/DataDog/helm-charts/issues/1101
Describe what happened: Given a GKE cluster version
1.25.7-gke.1000
in Autopilot mode, in the regionaustralia-southeast1
. When I deploy the datadog helm chart, thedatadog-agent-cluster-agent
gets deployed, but not thedatadog-agent
DaemonSet.Describe what you expected: A
datadog-agent
should be deployed on each node.Steps to reproduce the issue: Create a cluster using Terraform
Using
gke.tf
This results in the following cluster being created: Release channel:
Regular channel
Version:1.25.7-gke.1000
Create kube secrets in the default namespace
Deploy the helm chart for the agent
with
datadog-agent-values.yaml
Results in
Additional environment details (Operating System, Cloud provider, etc):
It's like the allow list is never populated:
Even though the CRD is there:
This was working fine for me in the past when creating a cluster with Version
1.24.9-gke.3200
inus-central1
: https://github.com/DataDog/helm-charts/issues/947 For that cluster i get: