DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Trace-Agent Fails to Start with Permission Denied Error After Upgrading to Datadog Agent `7.57.0` #29155

Closed LQss11 closed 1 month ago

LQss11 commented 1 month ago

Description

After upgrading to Datadog Agent 7.57.0 from 7.56.2, the trace-agent fails to start due to a permission error with the UDS listener, despite having datadog.apm.socketEnabled set to false.

Configuration

Here is the relevant portion of my values.yaml configuration:

targetSystem: linux
providers:
  aks:
    enabled: true

clusterAgent:
  image:
    doNotCheckTag: true
    tag: 7.57.0
  admissionController:
    configMode: service
    enabled: true
    mutateUnlabelled: true
  env:
  - name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_SECURITY_CONTEXT
    value: "{\"capabilities\":{\"drop\":[\"ALL\"]},\"runAsNonRoot\":true,\"runAsUser\":10000,\"readOnlyRootFilesystem\":true,\"allowPrivilegeEscalation\":false,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}}"            
datadog:
  apiKeyExistingSecret: datadog-secret
  site: datadoghq.com
  apm:
    portEnabled: true
    instrumentation:
      enabled: false
    # I was thinking that this will disable socket and use hostip if socket disabled for trace-agent
    socketEnabled: false
agents:
  containers:
    traceAgent:
      securityContext:
        runAsUser: 100
        runAsNonRoot: true
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false

  image:
    doNotCheckTag: true
    tag: 7.57.0

Error Message

The trace-agent logs show the following error:

2024-09-09 17:29:59 UTC | TRACE | CRITICAL | (pkg/trace/api/api.go:712 in func2) | Error creating UDS listener: listen unix /var/run/datadog/apm.socket: bind: permission denied

Steps to Reproduce

  1. Upgrade Datadog Agent to version 7.57.0.
  2. Apply the above configuration.
  3. Observe that the trace-agent fails to start with a permission denied error.

Additional Information

Could you please help in diagnosing this issue or provide guidance on how to resolve the permission issue with the UDS listener for the trace-agent?

rodehoed commented 1 month ago

Not only on k8s, also on a RHEL 9 system here....

FlorentClarret commented 1 month ago

Hi @LQss11 and @rodehoed,

We prepared a fix and we are going to release it very soon in 7.57.1.

ichinaski commented 1 month ago

To expand on the issue a bit: The crash seems to be related to a lack of permissions on the trace-agent process (likely due to the explicit securityContext restrictions from the configuration). This is combined with a new feature from 7.57 that defines a default UDS listener on /var/run/datadog/apm.socket, and seeing that the directory exists, the trace-agent startup process attempts to create the listener (which fails). This failure is what led to the crash.

The fix we have put together on https://github.com/DataDog/datadog-agent/pull/29218 will make sure we don't crash on these circumstances, and just log the error, while continuing the agent startup.

LQss11 commented 1 month ago

Thanks @ichinaski / @FlorentClarret for the responses! I resolved the issue by updating the Helm chart with:

agents:      
  env:
    # This works
    - name: DD_APM_RECEIVER_SOCKET
      value: "unix:///var/run/datadog/apm.socket"
    # This does not work
    # - name: DD_APM_RECEIVER_SOCKET
    #   value: "/var/run/datadog/apm.socket"

Even though /var/run/datadog/apm.socket is the default, specifying it without the unix:// prefix caused issues.

I also faced issues related to the new log launcher feature, which uses a JSON file under /opt/.... To fix it, I used different versions:

For future CI stability, I recommend testing with unprivileged user IDs. I’ve raised issue #29286.

FlorentClarret commented 1 month ago

Hello @LQss11 and @rodehoed,

We just released Agent 7.57.1 with a fix for this issue.

ichinaski commented 1 month ago

Closing this issue given the fix is now released.