DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.91k stars 1.21k forks source link

[BUG] system-probe crashes on startup due to failure to load module #18930

Open Xavientois opened 1 year ago

Xavientois commented 1 year ago

Agent Environment

Describe what happened:

When installing the datadog helm chart, the system-probe container crashes with the following output:

2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.syscall_monitor.enabled
2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.activity_dump.path_merge.enabled
2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.network.enabled
2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.activity_dump.cgroup_wait_list_size
2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2023-08-21 15:16:55 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:613 in func1) | configuration key `runtime_security_config.activity_dump.cgroup_dump_timeout` is deprecated, use `runtime_security_config.activity_dump.dump_duration` instead
2023-08-21 15:16:55 UTC | SYS-PROBE | INFO | (pkg/config/environment_detection.go:123 in detectFeatures) | 1 Features detected from environment: kubernetes
2023-08-21 15:16:55 UTC | SYS-PROBE | INFO | (pkg/runtime/runtime.go:27 in func1) | runtime: final GOMAXPROCS value is: 1
2023-08-21 15:16:55 UTC | SYS-PROBE | INFO | (comp/core/log/logger.go:87 in Infof) | starting system-probe v7.46.0
2023-08-21 15:16:56 UTC | SYS-PROBE | INFO | (pkg/network/tracer/utils_linux.go:27 in IsTracerSupportedByOS) | running on platform: 
2023-08-21 15:16:56 UTC | SYS-PROBE | INFO | (cmd/system-probe/modules/network_tracer.go:55 in func3) | enabling network performance monitoring (NPM)
2023-08-21 15:16:56 UTC | SYS-PROBE | INFO | (pkg/network/tracer/connection/tracer.go:172 in NewTracer) | fentry tracer not supported, falling back to kprobe tracer
2023-08-21 15:17:14 UTC | SYS-PROBE | WARN | (pkg/network/tracer/connection/kprobe/tracer.go:146 in LoadTracer) | error loading CO-RE network tracer, falling back to pre-compiled: failed to init ebpf manager: couldn't load eBPF programs: map connection_protocol: map create: cannot allocate memory
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (pkg/network/tracer/offsetguess/offsetguess.go:183 in RunOffsetGuessing) | offset guessing complete (took 1.586208525s)
2023-08-21 15:17:16 UTC | SYS-PROBE | ERROR | (cmd/system-probe/api/module/loader.go:65 in Register) | error creating module network_tracer: failed to init ebpf manager: {UID:net EBPFFuncName:tracepoint__net__net_dev_queue EBPFSection:} failed the sanity check: use CloneProbe to load 2 instances of the same program
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module tcp_queue_length_tracer disabled
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module oom_kill_probe disabled
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module event_monitor disabled
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module process disabled
2023-08-21 15:17:16 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module dynamic_instrumentation disabled
2023-08-21 15:17:16 UTC | SYS-PROBE | CRITICAL | (comp/core/log/logger.go:108 in Criticalf) | error while starting api server, exiting: failed to create system probe: no module could be loaded
Error: error while starting api server, exiting: failed to create system probe: no module could be loaded
Stream closed EOF for datadog/datadog-b8gh8 (system-probe)

Describe what you expected:

For the system-probe to start successfully

Steps to reproduce the issue:

Install the Datadog helm chart into a k8s 1.27 GKE cluster (I do not get this issue in EKS with the same chart, k8s version, and config).

Use the following values.yaml

datadog:
  prometheusScrape:
    enabled: true
    additionalConfigs:
      -
        configurations:
        - namespace: "<omitted>.${metrics_owner}"
          send_distribution_buckets: true
        autodiscovery:
          kubernetes_container_names:
            - app

  kubeStateMetricsNetworkPolicy:
    # datadog.kubeStateMetricsNetworkPolicy.create -- If true, create a NetworkPolicy for kube state metrics
    create: true

  ## This is required for Dogstatsd origin detection to work in dogstatsd and trace agent
  ## See https://docs.datadoghq.com/developers/dogstatsd/unix_socket/
  useHostPID: true

  ## dogstatsd configuration
  ## ref: https://docs.datadoghq.com/agent/kubernetes/dogstatsd/
  ## To emit custom metrics from your Kubernetes application, use DogStatsD.
  dogstatsd:
    # datadog.dogstatsd.useHostPort -- Sets the hostPort to the same value of the container port
    ## Needs to be used for sending custom metrics.
    ## The ports need to be available on all hosts.
    ##
    ## WARNING: Make sure that hosts using this are properly firewalled otherwise
    ## metrics and traces are accepted from any host able to connect to this host.
    useHostPort: true

  ## Enable logs agent and provide custom configs
  logs:
    # datadog.logs.enabled -- Enables this to activate Datadog Agent log collection
    ## ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    enabled: true

    # datadog.logs.containerCollectAll -- Enable this to allow log collection for all containers
    ## ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    containerCollectAll: true

    # datadog.logs.autoMultiLineDetection -- Allows the Agent to detect common multi-line patterns automatically.
    ## ref: https://docs.datadoghq.com/agent/logs/advanced_log_collection/?tab=configurationfile#automatic-multi-line-aggregation
    autoMultiLineDetection: true

  ## Enable apm agent and provide custom configs
  # apm:
  #   enabled: false

  ## Enable process agent and provide custom configs
  processAgent:
    # datadog.processAgent.processCollection -- Set this to true to enable process collection in process monitoring agent
    ## Requires processAgent.enabled to be set to true to have any effect
    processCollection: true

  networkMonitoring:
    # datadog.networkMonitoring.enabled -- Enable network performance monitoring
    enabled: true

  containerInclude: "<omitted>"

  containerExclude: "image:.*"

  # datadog.containerExcludeLogs -- Exclude logs from the Agent Autodiscovery,
  # as a space-separated list
  containerExcludeLogs: "<omitted>"

clusterAgent:
  # clusterAgent.replicas -- Specify the of cluster agent replicas, if > 1 it allow the cluster agent to work in HA mode.
  replicas: 3
  # Enable the metricsProvider to be able to scale based on metrics in Datadog
  metricsProvider:
    # clusterAgent.metricsProvider.enabled -- Set this to true to enable Metrics Provider
    enabled: true
  resources:
    requests:
      cpu: 500m
      memory: 500Mi
    limits:
      cpu: 500m
      memory: 500Mi

agents:
  containers:
    agent:
      resources:
        requests:
            cpu: 200m
            memory: 500Mi
        limits:
            cpu: 200m
            memory: 500Mi
    processAgent:
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi
    traceAgent:
      env: 
      - name: DD_APM_IGNORE_RESOURCES
        value: "OPTIONS *, GET /health, GET /ready"
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi
    systemProbe:
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi
    securityAgent:
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi
    initContainers:
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi

clusterChecksRunner:
  # clusterChecksRunner.enabled -- If true, deploys agent dedicated for running the Cluster Checks instead of running in the Daemonset's agents.
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/clusterchecks/
  enabled: true
  resources:
    requests:
      cpu: 500m
      memory: 500Mi
    limits:
      cpu: 500m
      memory: 500Mi

kube-state-metrics:
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi

Additional environment details (Operating System, Cloud provider, etc):

hmike96 commented 1 year ago

Getting the same issue with same version of agent on Openshift 4.11.

hmahmood commented 1 year ago

For the warning in the logs:

2023-08-21 15:17:14 UTC | SYS-PROBE | WARN | (pkg/network/tracer/connection/kprobe/tracer.go:146 in LoadTracer) | error loading CO-RE network tracer, falling back to pre-compiled: failed to init ebpf manager: couldn't load eBPF programs: map connection_protocol: map create: cannot allocate memory

This can be avoided by bumping up the memory (200Mi currently), which should load the system-probe.

The error:

2023-08-21 15:17:16 UTC | SYS-PROBE | ERROR | (cmd/system-probe/api/module/loader.go:65 in Register) | error creating module network_tracer: failed to init ebpf manager: {UID:net EBPFFuncName:tracepoint__net__net_dev_queue EBPFSection:} failed the sanity check: use CloneProbe to load 2 instances of the same program

... is caused by the previous warning, and is fixed in the upcoming 7.48 release.

dlouvier commented 1 year ago

I was able to reproduce the issue too with GKE and latest helm-chart.

Bumping the limits of the containers as suggested, seems to solve the problem.

In the values files in agents.containers.systemProbe.resources

        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 300m
          memory: 500Mi
hmahmood commented 1 year ago

@Xavientois is this resolved?

martinpiegay commented 1 year ago

Reporting here what Datadog Support told me:

A fix for this issue was introduced in 7.48 under this PR and has since also been backported to 7.47.1.