Open mologie opened 3 years ago
The workaround can be automated by patching the DaemonSet via helm and a post-renderer hook using yq:
datadog-passwd-fix.sh:
#!/bin/sh
exec yq e 'del(.. | select(has("name")) | select(.name == "passwd"))' -
helm install datadog-agent datadog/datadog --post-renderer="$PWD/datadog-passwd-fix.sh" ...
hello @mologie
Can we add instead type: FileOrCreate
to the hostpath volume definition?
I do not believe so. Talos' root file system is readonly, thus creation of /etc/passwd on the host will fail. The options I see are either adding a flag indicating that /etc/passwd is unavailable (the agent works just fine without it anyway), or working with the folks from Talos Systems to have Talos provide a dummy passwd with just the root user. I believe that the flag would be the more portable solution.
about the readonly file system, that's what I thought too
So I think adding an option in the values.yaml
file to avoid mounting the /etc/passwd
file is the best approach. Becasue the problem that I see with the post-render
approach is it will not work for some users since it depends on the OS where the helm
command runs.
Also, I see another issue for deploying the agent on Talos.dev: if you enable the log
support, the agent will try to mount hostpath in read/write mode. This volume should be converted to a LocalVolume
is Talos.dev support it.
The /var/log/pods path is a read-write xfs partition in Talos. The Datadog agent works correctly with it and maintains a read-only mount via its DaemonSet.
Agree /var/log/pods
and var/log/containers
are mounter in readonly
mode.
But, the datadog log agent needs a file (simple json file) on the host to store pointer for each file it is tailing. this file is useful for the agent after a restart (daemonset update or agent process restart) to be able to restart tailing files where it stoped and so avoiding sending twice the same logs.
this file is stored here
This feature works fine on Talos too -- all of /var
is just a standard writable partition. I see that the agent's DaemonSet creates /var/lib/datadog-agent for that purpose and can confirm that log trailing info is persisted properly there.
^^ I'm running into the same issue with datadog + K3s. This thread, and the datadog-passwd-fix.sh
post-render hack may just be the way to work around this.
Going to test and report back...but just wanted to share that even outside of Talos, this is a problem. I wonder if it should be raised with DataDog (GitHub issue), or through their support.
I got this mostly working with talos by removing the system-process altogether (because it wont work with Talos for me) and patching out the passwd:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: datadog
resources:
- namespace.yaml
- secrets.yaml #my sealed secret
patches:
- path: patch.json
target:
group: apps
version: v1
kind: DaemonSet
name: datadog
patchesStrategicMerge:
- patch.yaml
helmCharts:
- name: datadog
releaseName: datadog
version: 3.1.3
includeCRDs: true
repo: https://helm.datadoghq.com
valuesInline:
datadog:
apiKeyExistingSecret: datadog
logs:
enabled: true
containerCollectAll: true
networkMonitoring:
enabled: true
serviceMonitoring:
enabled: true
clusterAgent:
volumes:
- hostPath:
path: /run/containerd/containerd.sock
name: containerdsocket
volumeMounts:
- name: containerdsocket
mountPath: /var/run/containerd/containerd.sock
securityAgent:
compliance:
enabled: true
runtime:
enabled: true
syscallMonitor:
enabled: false
kube-state-metrics:
nodeSelector:
kubernetes.io/arch: amd64
agents:
podAnnotations:
container.apparmor.security.beta.kubernetes.io/agent: unconfined
container.apparmor.security.beta.kubernetes.io/process-agent: unconfined
patch.json
These values always exist, this chart has no way to turn off the system-probe
container
[
{"op": "remove",
"path": "/spec/template/metadata/annotations/container.seccomp.security.alpha.kubernetes.io~1system-probe",
"value": "localhost/system-probe"},
{"op": "remove",
"path": "/spec/template/metadata/annotations/container.apparmor.security.beta.kubernetes.io~1system-probe",
"value": "unconfined"}
]
patch.yaml Same as above, cant turn off so delete
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: datadog
spec:
template:
spec:
volumes:
- $patch: delete
name: passwd
containers:
- name: process-agent
volumeMounts:
- $patch: delete
mountPath: /etc/passwd
- $patch: delete
name: system-probe
For reference here is how I deployed Datadog agent under Talos 1.3. It needs two more workarounds:
(Set your own API key, and adjust your Datadog agent features to match what you'd like to enable.)
$ cat datadog-hostmount-fix.sh
#!/bin/sh
exec yq e 'del(.. | select(has("name")) | select(.name == "passwd" or .name == "os-release-file"))' -
$ ddog_ignore_tags=(
image:docker.io/calico/node
image:gcr.io/datadoghq/agent
image:gcr.io/datadoghq/cluster-agent
kube_namespace:metallb-system
)
$ kubectl create ns datadog
$ kubectl -n datadog create secret generic datadog-secret --from-literal=api-key=API_KEY
$ cluster_name=foo
$ cluster_availability_zone=bar
$ cluster_env=baz
$ helm upgrade --install datadog-agent datadog/datadog \
--post-renderer="$PWD/datadog-hostmount-fix.sh" \
--namespace datadog \
--set 'datadog.apiKeyExistingSecret=datadog-secret' \
--set 'datadog.apm.portEnabled=true' \
--set "datadog.clusterName=$cluster_name" \
--set "datadog.containerExclude=${ddog_ignore_tags[*]}" \
--set 'datadog.criSocketPath=/system/run/containerd/containerd.sock' \
--set 'datadog.dogstatsd.useHostPort=true' \
--set 'datadog.env[0].name=DD_HOSTNAME' \
--set 'datadog.env[0].valueFrom.fieldRef.fieldPath=spec.nodeName' \
--set 'datadog.env[1].name=DD_INVENTORIES_CONFIGURATION_ENABLED' \
--set 'datadog.env[1].value=true' \
--set 'datadog.logs.containerCollectAll=true' \
--set 'datadog.logs.enabled=true' \
--set 'datadog.processAgent.processCollection=true' \
--set 'datadog.site=datadoghq.eu' \
--set "datadog.tags={\"availability-zone:$cluster_availability_zone\",\"env:$cluster_env\"}" \
--set 'clusterAgent.replicas=2' \
--set 'clusterAgent.createPodDisruptionBudget=true'
Describe what happened:
I attempted to deploy Datadog Agent with helm to a Talos.dev cluster with default values.yaml, except I enabled the process agent. The Datadog daemonset attempted to mount /etc/passwd from the host, which does not exist on Talos. The daemonset would thus not start.
Describe what you expected:
The Datadog agent would deploy normally.
Steps to reproduce the issue:
Workaround: The daemon starts after patching the daemonset to remove the /etc/passwd volume.
Additional environment details (Operating System, Cloud provider, etc):
Talos 0.10