DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.74k stars 1.17k forks source link

Agent does not start with read-only file system #15127

Open kayman-mk opened 1 year ago

kayman-mk commented 1 year ago

Our security team asked me to make the root file system of all containers read only. But I figured out that the Datadog agent dies and is not able to run on a read only file system.

Log output

2023-01-18T09:21:53.820+01:00 | [s6-init] making user provided files available at /var/run/s6/etc...exited 0.
2023-01-18T09:21:53.906+01:00 | [s6-init] ensuring user provided files have correct perms...exited 0.
2023-01-18T09:21:53.945+01:00 | [fix-attrs.d] applying ownership & permissions fixes...
2023-01-18T09:21:53.959+01:00 | [fix-attrs.d] done.
2023-01-18T09:21:53.959+01:00 | [cont-init.d] executing container initialization scripts...
2023-01-18T09:21:53.959+01:00 | [cont-init.d] 01-check-apikey.sh: executing...
2023-01-18T09:21:53.960+01:00 | [cont-init.d] 01-check-apikey.sh: exited 0.
2023-01-18T09:21:53.962+01:00 | [cont-init.d] 50-ci.sh: executing...
2023-01-18T09:21:53.972+01:00 | ln: failed to create symbolic link '/etc/datadog-agent/datadog.yaml': Read-only file system
2023-01-18T09:21:53.972+01:00 | [cont-init.d] 50-ci.sh: exited 0.
2023-01-18T09:21:53.972+01:00 | [cont-init.d] 50-ecs.sh: executing...
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/network.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/io.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/disk.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/load.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.990+01:00 | rm: cannot remove '/etc/datadog-agent/conf.d/memory.d/conf.yaml.default': Read-only file system
2023-01-18T09:21:53.993+01:00 | [cont-init.d] 50-ecs.sh: exited 123.
2023-01-18T09:21:54.020+01:00 | [cont-finish.d] executing container finish scripts...
2023-01-18T09:21:54.022+01:00 | [cont-finish.d] done.
2023-01-18T09:21:54.023+01:00 | [s6-finish] waiting for services.
2023-01-18T09:21:54.227+01:00 | [s6-finish] sending all processes the TERM signal.
2023-01-18T09:21:57.262+01:00 | [s6-finish] sending all processes the KILL signal and exiting.

Agent Environment

I am pulling the agent from public.ecr.aws/datadog/agent:latest. I do not see a version number in the log. I included it as a side car to my AWS ECS task definition.

Describe what happened: After setting "readonlyRootFilesystem": true, in the task definition, the Datadog agent isn't able to start.

Describe what you expected: Datadog agent should run as normal.

Steps to reproduce the issue: Run the agent as a sidecar in AWS ECS. Set "readonlyRootFilesystem": true, in your container task definition.

Additional environment details (Operating System, Cloud provider, etc): AWS ECS

tomwire commented 1 year ago

Funny Im just now checking this off on my InfoSec checklist... Perfect timing?

vyrtus15 commented 1 year ago

+1 waiting for Datadog agent to work with read-only FS.

clamoriniere commented 1 year ago

Hi @kayman-mk, @tomwire and @vyrtus15

Thanks for reporting this issue.

In order to prioritise this feature request, please contact Datadog support and link this issue.

Thanks for your comprehension. 🙇

kayman-mk commented 1 year ago

Support contacted: https://help.datadoghq.com/hc/en-us/requests/1101939

maaz-nafees commented 1 year ago

Hi @kayman-mk, I ran into the same issue. Were you able to resolve this problem?

kayman-mk commented 1 year ago

@clamoriniere Any news here?

The support answered on Feb 20 with:

Thanks for getting back to me. I understand this is an important feature for your organisation. I've gone ahead and created a Feature Request for this with a note of it's impact on your business. In the meantime I'm going to mark this ticket as closed as your request has been processed.

tomwire commented 1 year ago

@kayman-mk

Our workaround was to docker diff the running container and get a list of all the paths that are written in the container. Then in the task definition that uses the datadog image, we added a docker volume which was configured to use those paths that came back in the docker diff. This doesnt necessarily need to be a docker volume, any would work. We only need to link /etc/datadog-agent and /opt/datadog-agent to that docker volume before locking down the root volume. I suspect people may have different paths that need to be available, but that's what worked for us.

Our agent is currently running and reporting correctly with the root volume locked.

kayman-mk commented 1 year ago

Good solution, @tomwire, but I am a little afraid that I run into problems if I update the version of the agent and it needs a different file set than the one before.

tomwire commented 1 year ago

@kayman-mk 100% agree, this is definitely the concern we have. I suspect the solution might end up being the configuration I recommended and a promise from DD that the filesystem will not be changed without proper notice. And some extra caution that our stacks are nothing alike, results may vary.

FWIW, our pipelines for our agents always grab the latest DD image, build and deploys, on a routine schedule. We haven't had any issues since and there have been updates.

I suppose a script that monitors syslog messages for permission errors on writing to files outside of the mounted volumes would save some headaches, but Im going to cross that bridge when DD breaks. I have a feeling the agents are well engineered and wont be throwing many surprises.

thiago-youper commented 10 months ago

+1 waiting for Datadog agent to work with read-only FS.

aayushchhabra1999 commented 10 months ago

+1

jornskjerven commented 9 months ago

+1

cgspohn commented 9 months ago

+1 Other vendors are supporting this already, so waiting for the official solution by DataDog. Formal support case also entered.

Siivers commented 9 months ago

+1

danlaramay commented 9 months ago

+1

naomichi-y commented 8 months ago

+1

SlevinWasAlreadyTaken commented 8 months ago

+1

h-nago commented 8 months ago

+1

jdliauw commented 8 months ago

+1

yokobot commented 8 months ago

+1

rod-murphy commented 7 months ago

+1

marklynch commented 6 months ago

Given this article https://docs.datadoghq.com/security/default_rules/cis-docker-1.2.0-5.12/ would be good to see progress on this.

eli-gc commented 5 months ago

I just got my agent deployed in AKS with read-only root filesystem. I am using the helm chart v3.52.0 I have readOnlyRootFilesystem enabled for initContainers, agent, process agent, and cluster agent. Not sure if this is a new feature, but might be worth it to try again for those of you who haven't checked in awhile.

henare commented 4 months ago

I also successfully have the agent running with a read-only root filesystem. This is on ECS Fargate.

When the agent boots it tries to write configuration to /etc/datadog-agent so you have to mount a read/write filesystem at this location. This can be done in your task definition by creating a volume and mounting it at that location in the agent container definition.

jjshinobi commented 2 months ago

+1 Can we please prioritise this? We'd like this to be solved in the Datadog agent rather than applying the workaround mentioned above. Thank you!

nihauc12 commented 1 month ago

+1

jjshinobi commented 3 weeks ago

This docker-compose.yml helps to test the issue locally. Working version:

services:
  datadog:
    image: public.ecr.aws/datadog/agent:7
    environment:
      - DD_API_KEY=<your_api_key>
      - DD_LOGS_ENABLED=true
      - DD_LOG_LEVEL=DEBUG
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup:/host/sys/fs/cgroup:ro
      - datadog:/etc/datadog-agent
      - datadog:/opt/datadog-agent/run
    read_only: true
volumes:
  datadog:

If /opt/datadog-agent is mounted the container dies. There are references of /opt/datadog-agent/run mount point in the codebase where the agent is running in Kubernetes cluster.