Datadog on Talos.dev: Option for not mounting /etc/passwd or providing an empty dummy file

mologie commented 3 years ago

Describe what happened:

I attempted to deploy Datadog Agent with helm to a Talos.dev cluster with default values.yaml, except I enabled the process agent. The Datadog daemonset attempted to mount /etc/passwd from the host, which does not exist on Talos. The daemonset would thus not start.

Describe what you expected:

The Datadog agent would deploy normally.

Steps to reproduce the issue:

Spawn a dummy talos.dev cluster
helm install --set datadog.processAgent.enabled=true ...
Observe that the daemonset pods fail to start (failed to create directory /etc/passwd)

Workaround: The daemon starts after patching the daemonset to remove the /etc/passwd volume.

Additional environment details (Operating System, Cloud provider, etc):

Talos 0.10

mologie commented 3 years ago

The workaround can be automated by patching the DaemonSet via helm and a post-renderer hook using yq:

datadog-passwd-fix.sh:

#!/bin/sh
exec yq e 'del(.. | select(has("name")) | select(.name == "passwd"))' -

helm install datadog-agent datadog/datadog --post-renderer="$PWD/datadog-passwd-fix.sh" ...

clamoriniere commented 3 years ago

hello @mologie

Can we add instead type: FileOrCreate to the hostpath volume definition?

mologie commented 3 years ago

I do not believe so. Talos' root file system is readonly, thus creation of /etc/passwd on the host will fail. The options I see are either adding a flag indicating that /etc/passwd is unavailable (the agent works just fine without it anyway), or working with the folks from Talos Systems to have Talos provide a dummy passwd with just the root user. I believe that the flag would be the more portable solution.

clamoriniere commented 3 years ago

about the readonly file system, that's what I thought too

So I think adding an option in the values.yaml file to avoid mounting the /etc/passwd file is the best approach. Becasue the problem that I see with the post-render approach is it will not work for some users since it depends on the OS where the helm command runs.

Also, I see another issue for deploying the agent on Talos.dev: if you enable the log support, the agent will try to mount hostpath in read/write mode. This volume should be converted to a LocalVolume is Talos.dev support it.

mologie commented 3 years ago

The /var/log/pods path is a read-write xfs partition in Talos. The Datadog agent works correctly with it and maintains a read-only mount via its DaemonSet.

clamoriniere commented 3 years ago

Agree /var/log/pods and var/log/containers are mounter in readonly mode.

But, the datadog log agent needs a file (simple json file) on the host to store pointer for each file it is tailing. this file is useful for the agent after a restart (daemonset update or agent process restart) to be able to restart tailing files where it stoped and so avoiding sending twice the same logs.

this file is stored here

mologie commented 3 years ago

This feature works fine on Talos too -- all of /var is just a standard writable partition. I see that the agent's DaemonSet creates /var/lib/datadog-agent for that purpose and can confirm that log trailing info is persisted properly there.

armenr commented 2 years ago

^^ I'm running into the same issue with datadog + K3s. This thread, and the datadog-passwd-fix.sh post-render hack may just be the way to work around this.

Going to test and report back...but just wanted to share that even outside of Talos, this is a problem. I wonder if it should be raised with DataDog (GitHub issue), or through their support.

myoung34 commented 1 year ago

I got this mostly working with talos by removing the system-process altogether (because it wont work with Talos for me) and patching out the passwd:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: datadog

resources:
  - namespace.yaml
  - secrets.yaml #my sealed secret
patches:
  - path: patch.json
    target:
      group: apps
      version: v1
      kind: DaemonSet
      name: datadog
patchesStrategicMerge:
 - patch.yaml

helmCharts:
- name: datadog
  releaseName: datadog
  version: 3.1.3
  includeCRDs: true
  repo: https://helm.datadoghq.com
  valuesInline:
    datadog:
      apiKeyExistingSecret:  datadog
      logs:
        enabled: true
        containerCollectAll: true
      networkMonitoring:
        enabled: true
      serviceMonitoring:
        enabled: true
    clusterAgent:
      volumes:
        - hostPath:
            path: /run/containerd/containerd.sock
          name: containerdsocket
      volumeMounts:
        - name: containerdsocket
          mountPath: /var/run/containerd/containerd.sock
      securityAgent:
        compliance:
          enabled: true
        runtime:
          enabled: true
          syscallMonitor:
            enabled: false
    kube-state-metrics:
      nodeSelector:
        kubernetes.io/arch: amd64
    agents:
      podAnnotations:
        container.apparmor.security.beta.kubernetes.io/agent: unconfined
        container.apparmor.security.beta.kubernetes.io/process-agent: unconfined

patch.json These values always exist, this chart has no way to turn off the system-probe container

[
  {"op": "remove",
   "path": "/spec/template/metadata/annotations/container.seccomp.security.alpha.kubernetes.io~1system-probe",
   "value": "localhost/system-probe"},
  {"op": "remove",
   "path": "/spec/template/metadata/annotations/container.apparmor.security.beta.kubernetes.io~1system-probe",
   "value": "unconfined"}
]

patch.yaml Same as above, cant turn off so delete

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: datadog
spec:
  template:
    spec:
      volumes:
      - $patch: delete
        name: passwd
      containers:
      - name: process-agent
        volumeMounts:
        - $patch: delete
          mountPath: /etc/passwd
      - $patch: delete
        name: system-probe

mologie commented 1 year ago

For reference here is how I deployed Datadog agent under Talos 1.3. It needs two more workarounds:

os-release-file needs to be removed in addition to removing the passwd mount described in this ticket.
Node name detection does not work. Instead, it always uses the current container's hostname as node name and claims the node is down when the container is replaced. Setting DD_HOSTNAME to nodeName appears to work around the issue. This is inconsistent with Datadog's usual canonical machine naming where the cluster name is usually appended.

(Set your own API key, and adjust your Datadog agent features to match what you'd like to enable.)

$ cat datadog-hostmount-fix.sh
#!/bin/sh
exec yq e 'del(.. | select(has("name")) | select(.name == "passwd" or .name == "os-release-file"))' -
$ ddog_ignore_tags=(
  image:docker.io/calico/node
  image:gcr.io/datadoghq/agent
  image:gcr.io/datadoghq/cluster-agent
  kube_namespace:metallb-system
)
$ kubectl create ns datadog
$ kubectl -n datadog create secret generic datadog-secret --from-literal=api-key=API_KEY
$ cluster_name=foo
$ cluster_availability_zone=bar
$ cluster_env=baz
$ helm upgrade --install datadog-agent datadog/datadog \
  --post-renderer="$PWD/datadog-hostmount-fix.sh" \
  --namespace datadog \
  --set 'datadog.apiKeyExistingSecret=datadog-secret' \
  --set 'datadog.apm.portEnabled=true' \
  --set "datadog.clusterName=$cluster_name" \
  --set "datadog.containerExclude=${ddog_ignore_tags[*]}" \
  --set 'datadog.criSocketPath=/system/run/containerd/containerd.sock' \
  --set 'datadog.dogstatsd.useHostPort=true' \
  --set 'datadog.env[0].name=DD_HOSTNAME' \
  --set 'datadog.env[0].valueFrom.fieldRef.fieldPath=spec.nodeName' \
  --set 'datadog.env[1].name=DD_INVENTORIES_CONFIGURATION_ENABLED' \
  --set 'datadog.env[1].value=true' \
  --set 'datadog.logs.containerCollectAll=true' \
  --set 'datadog.logs.enabled=true' \
  --set 'datadog.processAgent.processCollection=true' \
  --set 'datadog.site=datadoghq.eu' \
  --set "datadog.tags={\"availability-zone:$cluster_availability_zone\",\"env:$cluster_env\"}" \
  --set 'clusterAgent.replicas=2' \
  --set 'clusterAgent.createPodDisruptionBudget=true'

DataDog / helm-charts

Datadog on Talos.dev: Option for not mounting /etc/passwd or providing an empty dummy file #273