DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.91k stars 1.21k forks source link

[BUG] Agent can't detect Podman rootless container cgroup v1 path #20225

Open JohnCgp opened 1 year ago

JohnCgp commented 1 year ago

Agent Environment

Agent is run as a rootless podman container on RHEL 8 behind a corporate proxy:

DD_HOSTNAME=$HOSTNAME \
DD_API_KEY=$DD_API_KEY \
HTTP_PROXY=$PROXY \
HTTPS_PROXY=$PROXY \
NO_PROXY=$NO_PROXY \
        podman run --rm --name dd-agent \
    --cgroupns host --pid host \
        -v $USER_BOLT_STATE_PATH:/var/lib/containers/storage/libpod/bolt_state.db:ro \
        -v ${HOME}/datadog-agent/ntp.d:/etc/datadog-agent/conf.d/ntp.d:rw \
        -v /proc/:/host/proc/:ro \
        -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
        -v ${HOME}/datadog-agent/run:/opt/datadog-agent/run:rw \
        -e DD_API_KEY \
        -e DD_LOGS_ENABLED=true \
        -e DD_HOSTNAME \
        -e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true \
        -e DD_LOGS_CONFIG_USE_HTTP=true \
        -e DD_LOG_LEVEL=debug \
        gcr.io/datadoghq/agent:latest
$ podman exec -it dd-agent agent version
Agent 7.48.0 - Commit: 8e04ee8 - Serialization version: v5.0.93 - Go version: go1.20.8

Describe what happened:

Container logs are not being sent to Datadog; container_collect_all integration's "Status" is stuck as "Pending" (see bottom of this section).

Cause may be that the agent is not able to collect the container stats because it is not correctly finding the container cgroups.

When running the agent with debug logging, an entry similar to this one is emitted for each container:

2023-10-18 09:29:57 UTC | CORE | DEBUG | (pkg/collector/corechecks/containers/generic/processor.go:103 in Run) | Container stats for: &{ { container 274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a } { dd-agent  map[io.podman.annotations.autoremove:TRUE] map[baseimage.name:ubuntu: baseimage.os:ubuntu  maintainer:Datadog <package@datadoghq.com> org.opencontainers.image.ref.name:ubuntu org.opencontainers.image.source:https://github.com/DataDog/datadog-agent org.opencontainers.image.version:23.04] } map[DOCKER_DD_AGENT:true] 274f829a1476 { 5273d32e45bea458bfc2923849f66bd8dd3582c2639619bd66a2e67b6e06f22b gcr.io/datadoghq/agent:latest gcr.io/datadoghq/agent gcr.io agent latest } map[] 3902885 [] podman { true running  2023-10-18 10:29:51.461165613 +0100 +0100 2023-10-18 10:29:51.461165613 +0100 +0100 0001-01-01 00:00:00 +0000 UTC <nil> } [] <nil> <nil> } not available through collector "system", err: containerID not found

Trawling through the code, collector.GetContainerStats fails:

https://github.com/DataDog/datadog-agent/blob/8e04ee8a7de780174deeb963b61f4ca6fc129462/pkg/process/util/containers/containers.go#L153-L158

Because the call to c.getCgroup on line 105 fails:

https://github.com/DataDog/datadog-agent/blob/8e04ee8a7de780174deeb963b61f4ca6fc129462/pkg/util/containers/metrics/system/collector_linux.go#L104-L108

As it gets nil

https://github.com/DataDog/datadog-agent/blob/8e04ee8a7de780174deeb963b61f4ca6fc129462/pkg/util/containers/metrics/system/collector_linux.go#L164-L167

Because the cgroup for the given container ID isn't in the collection:

https://github.com/DataDog/datadog-agent/blob/8e04ee8a7de780174deeb963b61f4ca6fc129462/pkg/util/cgroups/reader.go#L180

It seems that the parseCgroups function isn't able to extract the cgroup for the container for some reason.

Doing a find for the container's ID does indeed return its cgroups. As it's running as a container, /sys/fs/cgroup doesn't return any results, as expected. They are under /host/sys/fs/cgroup.

$ podman exec -it dd-agent bash
root@274f829a1476:/# find /sys/fs/cgroup | grep 274f829a1476
root@274f829a1476:/# find /host/sys/fs/cgroup | grep 274f829a1476
/host/sys/fs/cgroup/systemd/user.slice/user-7780.slice/user@7780.service/user.slice/podman-3902834.scope/274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a
/host/sys/fs/cgroup/systemd/user.slice/user-7780.slice/user@7780.service/user.slice/podman-3902834.scope/274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a/cgroup.procs
/host/sys/fs/cgroup/systemd/user.slice/user-7780.slice/user@7780.service/user.slice/podman-3902834.scope/274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a/tasks
/host/sys/fs/cgroup/systemd/user.slice/user-7780.slice/user@7780.service/user.slice/podman-3902834.scope/274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a/notify_on_release
/host/sys/fs/cgroup/systemd/user.slice/user-7780.slice/user@7780.service/user.slice/podman-3902834.scope/274f829a147618e01276b4dad9da3b0257f4a54aac8193e409394533e327469a/cgroup.clone_children

The system is using cgroupsv1. Note /sys/fs/cgroup mounted as tmpfs and /sys/fs/cgroup/systemd mounted as cgroup:

root@274f829a1476:/# cat /proc/mounts
overlay / overlay rw,nodev,relatime,lowerdir=/nfa/ps223062/podman/storage/graphroot/overlay/l/O4FXG4PCTK5XXMVWCQ3FIWI4JT,upperdir=/nfa/ps223062/podman/storage/graphroot/overlay/faef8b6be96f1a02f08533facaa53b95ee96cd35519d0a053c268ff69353bd33/diff,workdir=/nfa/ps223062/podman/storage/graphroot/overlay/faef8b6be96f1a02f08533facaa53b95ee96cd35519d0a053c268ff69353bd33/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,noexec,size=65536k,mode=755,uid=7780,gid=100 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
/dev/mapper/appvg-nfa_lv /run/.containerenv xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
proc /host/proc proc ro,nosuid,nodev,noexec,relatime 0 0
systemd-1 /host/proc/sys/fs/binfmt_misc autofs rw,relatime,fd=40,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=23933 0 0
binfmt_misc /host/proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=64000k,uid=7780,gid=100 0 0
/dev/mapper/appvg-nfa_lv /etc/hosts xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/mapper/appvg-nfa_lv /etc/resolv.conf xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=624292,mode=620,ptmxmode=666 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
/dev/mapper/appvg-nfa_lv /etc/hostname xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/mapper/appvg-nfa_lv /var/run/s6 xfs rw,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/mapper/appvg-nfa_lv /var/log/datadog xfs rw,nosuid,nodev,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,mode=755,uid=7780,gid=100 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/rdma cgroup ro,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
/dev/mapper/rootvg-home_lv /opt/datadog-agent/run xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/mapper/rootvg-home_lv /etc/datadog-agent/conf.d/ntp.d xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
tmpfs /host/sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /host/sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /host/sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /host/sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /host/sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /host/sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /host/sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /host/sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /host/sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /host/sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /host/sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /host/sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /host/sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
/dev/mapper/appvg-nfa_lv /var/lib/containers/storage/libpod/bolt_state.db xfs ro,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
devtmpfs /dev/null devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /dev/random devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /dev/full devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /dev/tty devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /dev/zero devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /dev/urandom devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/acpi tmpfs ro,relatime,uid=7780,gid=100 0 0
devtmpfs /proc/kcore devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /proc/keys devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /proc/timer_list devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
devtmpfs /proc/sched_debug devtmpfs rw,nosuid,size=131783080k,nr_inodes=32945770,mode=755 0 0
tmpfs /proc/scsi tmpfs ro,relatime,uid=7780,gid=100 0 0
tmpfs /sys/firmware tmpfs ro,relatime,uid=7780,gid=100 0 0
tmpfs /sys/dev/block tmpfs ro,relatime,uid=7780,gid=100 0 0

container_collect_all integration status is pending:

$ podman exec -it dd-agent agent status
[...]

Metadata
  ========
    agent_version: 7.48.0
    config_apm_dd_url:
    config_dd_url:
    config_logs_dd_url:
    config_logs_socks5_proxy_address:
    config_no_proxy: [localhost ***********]
    config_process_dd_url:
    config_proxy_http: http://user:********@proxy:port
    config_proxy_https: http://user:********@proxy:port
    config_site:
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_cws_network_enabled: true
    feature_cws_remote_config_enabled: false
    feature_cws_security_profiles_enabled: false
    feature_dynamic_instrumentation_enabled: false
    feature_fips_enabled: false
    feature_imdsv2_enabled: false
    feature_logs_enabled: true
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_oom_kill_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: false
    feature_process_language_detection_enabled: false
    feature_processes_container_enabled: true
    feature_remote_configuration_enabled: true
    feature_tcp_queue_length_enabled: false
    feature_usm_enabled: false
    feature_usm_go_tls_enabled: false
    feature_usm_http2_enabled: false
    feature_usm_http_by_status_code_enabled: false
    feature_usm_istio_enabled: false
    feature_usm_java_tls_enabled: false
    feature_usm_kafka_enabled: false
    flavor: agent
    hostname_source: configuration
    install_method_installer_version: docker
    install_method_tool: docker
    install_method_tool_version: docker
    logs_transport: HTTP
    system_probe_core_enabled: true
    system_probe_gateway_lookup_enabled: true
    system_probe_kernel_headers_download_enabled: false
    system_probe_max_connections_per_message: 600
    system_probe_prebuilt_fallback_enabled: true
    system_probe_protocol_classification_enabled: true
    system_probe_root_namespace_enabled: true
    system_probe_runtime_compilation_enabled: false
    system_probe_telemetry_enabled: true
    system_probe_track_tcp_4_connections: true
    system_probe_track_tcp_6_connections: true
    system_probe_track_udp_4_connections: true
    system_probe_track_udp_6_connections: true

[...]

=========
Collector
=========

  Running Checks
  ==============

    container
    ---------
      Instance ID: container [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/container.d/conf.yaml.default
      Total Runs: 140
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 6ms
      Last Execution Date : 2023-10-18 10:04:42 UTC (1697623482000)
      Last Successful Execution Date : 2023-10-18 10:04:42 UTC (1697623482000)

==========
Logs Agent
==========
    Reliable: Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.com on port 443
    BytesSent: 0
    EncodedBytesSent: 2
    LogsProcessed: 0
    LogsSent: 0

  ============
  Integrations
  ============

  container_collect_all
  ---------------------
    - Type: podman
      Status: Pending
      Bytes Read: 0
      Pipeline Latency:
        Average Latency (ms): 0
        24h Average Latency (ms): 0
        Peak Latency (ms): 0
        24h Peak Latency (ms): 0
    - Type: podman
      Status: Pending
      Bytes Read: 0
      Pipeline Latency:
        Average Latency (ms): 0
        24h Average Latency (ms): 0
        Peak Latency (ms): 0
        24h Peak Latency (ms): 0

Describe what you expected:

Logs from all the containers to be sent to Datadog (containers_collect_all status to be "OK" instead of pending).

Steps to reproduce the issue:

Run agent as container as per first section. Tried both with and without --cgroupns host --pid host.

Additional environment details (Operating System, Cloud provider, etc): RHEL 8

JohnCgp commented 1 year ago

For anyone else that stumbles upon this, I spoke with support and it's not a bug - container log collection is not supported for rootless Podman containers as of Agent v7.48.0.

SArnab commented 8 months ago

Seeing the same issue for general container metrics collection on RHEL8 with Podman - any update or solution here?

Agent Version: Agent 7.52.0 - Commit: 4a6318a - Serialization version: v5.0.104 - Go version: go1.21.8

Error Log: 2024-03-27 22:31:16 UTC | PROCESS | DEBUG | (pkg/process/util/containers/containers.go:159 in GetContainers) | << redacted >> Runtime:podman RuntimeFlavor: State:{Running:true Status:running Health: CreatedAt:2024-03-27 17:46:57.046830482 -0400 -0400 StartedAt:2024-03-27 17:46:57.046830482 -0400 -0400 FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:<nil>} CollectorTags:[] Owner:<nil> SecurityContext:<nil> Resources:{CPURequest:<nil> MemoryRequest:<nil>}} not available, err: containerID not found