DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.83k stars 1.19k forks source link

System Probe failing to start on RHEL #4745

Open devinmatte opened 4 years ago

devinmatte commented 4 years ago

Output of the info page (if this is a bug)

===============
Agent (v7.16.1)
===============

  Status date: 2020-01-17 16:30:42.048745 EST
  Agent start: 2020-01-17 15:02:08.436442 EST
  Pid: 62125
  Go Version: go1.12.9
  Python Version: 3.7.4
  Build arch: amd64
  Check Runners: 4

  Log File: /var/log/datadog/agent.log
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -10.122ms
    System UTC time: 2020-01-17 16:30:42.048745 EST

  Host Info
  =========
    bootTime: 2020-01-17 13:24:26.000000 EST
    kernelVersion: 3.10.0-1062.7.1.el7.x86_64
    os: linux
    platform: redhat
    platformFamily: rhel
    platformVersion: 7.7
    procs: 232
    uptime: 1h37m46s
    virtualizationRole: guest
    virtualizationSystem: kvm

  Hostnames
  =========
    hostname: ~~REDACTED~~
    socket-fqdn: ~~REDACTED~~
    socket-hostname: ~~REDACTED~~
    hostname provider: configuration

  Metadata
  ========
    hostname_source: configuration

Describe what happened: datadog-agent-sysprobe enters a failed state with this in the logs:

 | INFO | (cmd/system-probe/probe.go:50 in CreateSystemProbe) | Creating tracer for: system-probe
 | INFO | (pkg/process/config/tracer_config.go:37 in SysProbeConfigFromConfig) | system probe DNS inspection disabled by configuration
 | INFO | (pkg/ebpf/tracer.go:109 in NewTracer) | detected platform 3.10.0, switch to use kprobes from kernel version < 4.1.0
 | CRITICAL | (cmd/system-probe/main.go:122 in main) | failed to create system probe: could not load bpf module: error while loading map "maps/conn_stats": permission denied

Describe what you expected: Expected sysprobe to start running and collecting data

Steps to reproduce the issue: Attempt to start datadog-agent-sysprobe via systemctl start datadog-agent-sysprobe

Additional environment details (Operating System, Cloud provider, etc): Issue occurring on RHEL 7.x (Both 7.7 and 7.6)

kylegoch commented 4 years ago

We had something similar happen. Do you have SELinux on? If so, the DD System Probe doesnt work out of the box with SELinux. We had to make a policy to let the System Probe service get past SELinux to make it work.

devinmatte commented 4 years ago

Yes we have SELinux on. What policy was that? I also find it curious that RHEL8 doesn't have this issue despite also having SELinux on

adamf commented 4 years ago

I'm seeing this as well, using the stock datadog-agent image.

kylegoch commented 4 years ago

Using the audit2allow tool and some manual editing, we got this policy working for us.

module spc_bpf_allow 1.0;

require {
    type spc_t;
    class bpf {map_create map_read map_write prog_load prog_run};
}

#============= spc_t ==============
allow spc_t self:bpf { map_create map_read map_write prog_load prog_run };

Also, there are certain syscalls needed for the systemprobe container as well, make sure you are using the latest Helm chart.

Edit: Added note about syscalls and a more restrictive selinux policy.

brycekahle commented 3 years ago

@devinmatte @adamf Are you still seeing this issue? We do have SELinux policies available if you aren't using our Helm chart. https://docs.datadoghq.com/network_performance_monitoring/installation/?tab=agent#selinux-enabled-systems

bbensky commented 3 years ago

@devinmatte @adamf Are you still seeing this issue? We do have SELinux policies available if you aren't using our Helm chart. https://docs.datadoghq.com/network_performance_monitoring/installation/?tab=agent#selinux-enabled-systems

We've been seeing a similar issue how would you suggest addressing this if we are using the Helm chart?

brycekahle commented 3 years ago

@bbensky I'd make sure you are using the latest version of the chart. If that doesn't fix it, please post more details so we can dig in.

bbensky commented 3 years ago

Thanks @brycekahle . I am using the newest version of the chart but still getting the same error at container start up. However one of our networking folks is working on setting up an SELinux policy to make this work.

L3n41c commented 3 years ago

I just tried to reproduce this issue on OpenShift 3.11 and I confirm that, on RHEL 7.7, the system-probe running with the spc_t SELinux type isn’t allowed to perform eBPF operations.

I managed to get it fixed with the same policy as @kylegoch.

$ cat >allow_spc_bpf.te <<EOF
module allow_spc_bpf 1.0;

require {
    type spc_t;
    class bpf { map_create map_read map_write prog_load prog_run };
}

#============= spc_t ==============
allow spc_t self:bpf { map_create map_read map_write prog_load prog_run };
EOF
$ checkmodule -M -m -o allow_spc_bpf.mod allow_spc_bpf.te
$ semodule_package -o allow_spc_bpf.pp -m allow_spc_bpf.mod
$ semodule -i allow_spc_bpf.pp

After this, the system-probe container could start properly on RHEL 7.7.

tuan-nguyen-ts commented 1 year ago

The disabling selinux solution works for me.