system-probe container crashing

michaelst commented 2 years ago

Describe what happened:

I upgraded from 2.23.1 to 2.27.3 and on the daemonset the system-probe container ends up in a crash loop backoff.

2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:610 in func1) | runtime: final GOMAXPROCS value is: 4
2021-12-07 19:19:36 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:630 in func1) | Unknown key in config file: runtime_security_config.debug
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:610 in func1) | Features detected from environment: kubernetes
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:605 in func1) | runtime_security_config.enabled or runtime_security_config.fim_enabled detected, enabling system-probe
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (cmd/system-probe/app/run.go:143 in StartSystemProbe) | running system-probe with version: Agent 7.32.1 - Commit: 52f034f743 - Serialization version: v4.85.0 - Go version: go1.16.7
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:42 in Register) | network_tracer module disabled
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:42 in Register) | tcp_queue_length_tracer module disabled
2021-12-07 19:19:36 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:42 in Register) | oom_kill_probe module disabled
2021-12-07 19:19:37 UTC | SYS-PROBE | INFO | (pkg/tagger/remote/tagger.go:336 in func1) | tagger stream established successfully
2021-12-07 19:19:37 UTC | SYS-PROBE | INFO | (pkg/tagger/remote/tagger.go:109 in Init) | remote tagger initialized successfully
2021-12-07 19:19:39 UTC | SYS-PROBE | ERROR | (cmd/system-probe/api/module/loader.go:56 in Register) | error registering HTTP endpoints for module `security_runtime` error: failed to start probe: probes activation validation failed: 3 errors occurred:
        * AllOf requirement failed, the following probes are not running [{UID:security EBPFSection:kprobe/sel_write_enforce EBPFFuncName:kprobe_sel_write_enforce}: couldn't enable kprobe {UID:security EBPFSection:kprobe/sel_write_enforce EBPFFuncName:kprobe_sel_write_enforce}: cannot write "p:p_sel_write_enforce_security_1155936 sel_write_enforce\n" to kprobe_events: write /sys/kernel/debug/tracing/kprobe_events: no such file or directory]
        * AllOf requirement failed, the following probes are not running [{UID:security EBPFSection:kprobe/sel_write_bool EBPFFuncName:kprobe_sel_write_bool}: couldn't enable kprobe {UID:security EBPFSection:kprobe/sel_write_bool EBPFFuncName:kprobe_sel_write_bool}: cannot write "p:p_sel_write_bool_security_1155936 sel_write_bool\n" to kprobe_events: write /sys/kernel/debug/tracing/kprobe_events: no such file or directory]
        * AllOf requirement failed, the following probes are not running [{UID:security EBPFSection:kprobe/sel_commit_bools_write EBPFFuncName:kprobe_sel_commit_bools_write}: couldn't enable kprobe {UID:security EBPFSection:kprobe/sel_commit_bools_write EBPFFuncName:kprobe_sel_commit_bools_write}: cannot write "p:p_sel_commit_bools_write_security_1155936 sel_commit_bools_write\n" to kprobe_events: write /sys/kernel/debug/tracing/kprobe_events: no such file or directory]
2021-12-07 19:19:39 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:42 in Register) | process module disabled
2021-12-07 19:19:39 UTC | SYS-PROBE | CRITICAL | (cmd/system-probe/app/run.go:188 in StartSystemProbe) | Error while starting api server, exiting: failed to create system probe: no module could be loaded
Error: Error while starting api server, exiting: failed to create system probe: no module could be loaded

Additional environment details (Operating System, Cloud provider, etc):

GKE

clamoriniere commented 2 years ago

Hi @michaelst could you give use the version of the datadog-agent install with your chart: you can check the image tag? or run the command kubectl exec -it <agent pod name> -c agent -- agent version

The issue seems to be more related to an update of the agent version than the chart version. by updating the chart, you might use a different agent version.

Based on the changelogs between 2.23.1 and 2.27.1, the default agent version move from 7.31.0 to 7.32.1.

for now what I can propose to you is to for the previous agent version with this parameter agents.image.tag:7.31.1 to override the default value.

or in you values.yaml

agents:
  image:
    tag: 7.31.1

michaelst commented 2 years ago

we weren't overriding the datadog-agent version so whatever was part of the chart is what was installed, I will try the upgrade again tomorrow with the agent override to the old version

clamoriniere commented 2 years ago

@michaelst To better understand your configuration could you share your values.yaml used to deploy/update your datadog chart deployment? 🙇 doest datadog.securityAgent.runtime.enabled was set to true?

michaelst commented 2 years ago


datadog:
  containerExclude: 'image:gke.gcr.io/* image:k8s.gcr.io/* image:gcr.io/stackdriver-agents/* image:gcr.io/datadoghq/* image:docker.io/istio/pilot'
  dogstatsd:
    useHostPort: true
  logs:
    enabled: true
    containerCollectAll: true
  apm:
    portEnabled: true
  orchestratorExplorer:
    container_scrubbing:
      enabled: true
  kubeStateMetricsNetworkPolicy:
    create: true
  networkPolicy:
    create: true
  processAgent:
    processCollection: true
  # https://docs.datadoghq.com/integrations/kubernetes_state_core/?tab=helm
  kubeStateMetricsCore:
    enabled: true
  kubeStateMetricsEnabled: false
  securityAgent:
    compliance:
      enabled: true
    runtime:
      enabled: true
clusterAgent:
  podSecurity:
    podSecurityPolicy:
      create: true
agents:
  podSecurity:
    podSecurityPolicy:
      create: true
  containers:
    agent:
      env:
      - name: DD_LEADER_ELECTION
        value: "true"

lebauce commented 2 years ago

Hello @michaelst Are you running Container Optimized OS ? We just pushed and merged a PR that should make CWS start on kernels missing SELinux support - which is the case here - and will be part of the next 7.33 release. That being said, we do not support Container Optimized OS yet, and some features may not work correctly. Proper support is on the way though.

michaelst commented 2 years ago

Ya we are running Container-Optimized OS with Containerd (cos_containerd), haven't had any issues with it until this though

lodotek commented 1 year ago

Is this still an issue? Why has this been open for years now?

lodotek commented 1 year ago

I seem to be having a similar issue. Only one of the DD agent pods is having the issue with the system-probe container crashing.

│ 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module tcp_queue_length_tracer disabled                                                                                                     │
│ 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module oom_kill_probe disabled                                                                                                              │
│ 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module security_runtime disabled                                                                                                            │
│ 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module process disabled                                                                                                                     │
│ 2023-05-11 20:25:26 UTC | SYS-PROBE | CRITICAL | (comp/core/log/logger.go:92 in Criticalf) | error while starting api server, exiting: failed to create system probe: no module could be loaded                                                   │
│ Error: error while starting api server, exiting: failed to create system probe: no module could be loaded                                                                                                                                         │
│ Stream closed EOF for datadog/datadog-agent-gkd5m (system-probe)

any ideas?

This is with datadog-3.29.0

clamoriniere commented 1 year ago

Hi @lodotek

Thanks for commenting this issue, but I think the issue that you are facing is not related to the initial reported problem.

Unfortunately the investigation will require more information like the helm values.yaml file, the agent version and configuration.

Could you please contact Datadog support and provide Agent and Cluster-Agent flare

clamoriniere commented 1 year ago

For users that encounter system-probe container start issues on COS (Container-Optimized OS),

we have added recently a new option in the chart, to configure the agent deployment specifically for COS. If you are using COS please add this option in the values.yaml chart deployment configuration

providers:
  gke:
    cos: true

DataDog / helm-charts

system-probe container crashing #458