[BUG] Journalctl error pulling from collector "podman" on default install

parkerderek commented 2 years ago

Agent Environment

Output from datadog-agent status: Agent (v7.38.2)\ Go Version: go1.17.11\ Python Version: 3.8.13\ Build arch: amd64\ Agent flavor: agent\ Check Runners: 4\ Log Level: info

Describe what happened: When trying systemctl status datadog-agent; CORE | WARN messages appear - such as (pkg/workloadmeta/store.go:359 in func1) | error pulling from collector "podman": error opening database /var/lib/containers/storage/libpod/bolt_state.db

Describe what you expected: Error messages to not appear, as it is a default install of datadog agent, with no podman configured or enabled, not running on container

Steps to reproduce the issue: Install default datadog DD_AGENT_MAJOR_VERSION=7 DD_API_KEY="KEY" DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

Additional environment details (Operating System, Cloud provider, etc): kernelArch: x86_64\ kernelVersion: 4.18.0-372.19.1.el8_6.x86_64\ os: linux\ platform: redhat\ platformFamily: rhel\ platformVersion: 8.6\ virtualizationRole: host\ virtualizationSystem: kvm

gudvardur commented 2 years ago

I have the exact same problem. But I installed using Ansible with the following vars

---
datadog_additional_groups: 
  - "systemd-journal"
datadog_config:
  dogstatsd_non_local_traffic: true
  logs_enabled: true
  logs_config:
    container_collect_all: false
    use_http: true
    auto_multi_line_detection: true
  process_config:
    enabled: 'true'
datadog_disable_untracked_checks: true
datadog_disable_default_checks: true
datadog_additional_checks:
  - cpu
  - disk
  - file_handle
  - io
  - load
  - memory
  - network
  - ntp
  - uptime
datadog_checks:
  disk:
    init_config:
    instances:
      - use_mount: false
  journald:
    logs:
      - type: journald
        container_mode: true

Agent Environment Agent (v7.38.2) Go Version: go1.17.11 Python Version: 3.8.13 Build arch: amd64 Agent flavor: agent Check Runners: 4 Log Level: info

agent[1140]: 2022-09-05 09:56:39 GMT | CORE | WARN | (pkg/workloadmeta/store.go:359 in func1) | error pulling from collector "podman": error opening database /var/lib/containers/storage/libpod/bolt_state.db
process-agent[1141]: 2022-09-05 09:56:40 GMT | PROCESS | WARN | (pkg/workloadmeta/store.go:359 in func1) | error pulling from collector "podman": error opening database /var/lib/containers/storage/libpod/bolt_state.db

Additional environment details (Operating System, Cloud provider, etc): kernelArch: x86_64 kernelVersion: 4.18.0-372.19.1.el8_6.x86_64 os: linux platform: redhat platformFamily: rhel platformVersion: 8.6 virtualizationRole: host virtualizationSystem: vmware

yeleyj commented 2 years ago

Same. I had a fine working install but wanted to update. One of my servers was fine, the other had this issue. The one without issues is the same for agent version, linux version, etc.

DD_AGENT_MAJOR_VERSION=7 DD_API_KEY= DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

==> /var/log/datadog/process-agent.log <==
2022-09-07 23:40:19 UTC | PROCESS | WARN | (pkg/workloadmeta/store.go:359 in func1) | error pulling from collector "podman": error opening database /var/lib/containers/storage/libpod/bolt_state.db

Agent 7.38.2 - Commit: ba442fd - Serialization version: v5.0.23 - Go version: go1.17.11

CentOS Linux release 8.3.2011 x86_64 running on GCP and not in any kind of docker or anything

In the config, trying to follow the steps to exclude container detection does not seem to work container_exclude: name:. and container_exclude: "name:." both do not make any difference.

Neither upgrading nor removing podman changes anything. The file bolt_state.db does not even exist at this point.

https://docs.datadoghq.com/integrations/container/ -- adding a blank config per this and restarting changed nothing; it's still trying to open the file.

Removing the container config file found in the status command and restarting changes nothing (/etc/datadog-agent/conf.d/container.d/conf.yaml.default)

gudvardur commented 2 years ago

After adding

autoconfig_exclude_features:
  - podman

to datadog.yaml so the config looks like this...

---
datadog_additional_groups: 
  - "systemd-journal"
datadog_config:
  autoconfig_exclude_features:
    - podman
  dogstatsd_non_local_traffic: true
  logs_enabled: true
  logs_config:
    container_collect_all: false
    use_http: true
    auto_multi_line_detection: true
  process_config:
    enabled: 'true'
datadog_disable_untracked_checks: true
datadog_disable_default_checks: true
datadog_additional_checks:
  - cpu
  - disk
  - file_handle
  - io
  - load
  - memory
  - network
  - ntp
  - uptime
datadog_checks:
  disk:
    init_config:
    instances:
      - use_mount: false
  journald:
    logs:
      - type: journald
        container_mode: true

Now the error has stopped...

yeleyj commented 2 years ago

autoconfig_exclude_features:
  - podman

This prevents the error for me as well. Thanks so much!

rsumner commented 1 year ago

Disabling the podman feature isn't an option for those folks who need the "container" integration to function properly on a host that runs containers. The root cause of this bug is the BoltDB Go client. Looking at the dd-agent source code in file pkg/util/podman/db_client.go, you can see the client is trying to open the file in read/write mode:

142 func (client *DBClient) getDBCon() (*bolt.DB, error) {
143         db, err := bolt.Open(client.DBPath, 0600, nil)
144         if err != nil {
145                 return nil, fmt.Errorf("error opening database %s", client.DBPath)
146         }
147 
148         return db, nil
149 }

I was forced to set the permissions on /var/lib/containers/storage/libpod/bolt_state.db to 666 to allow the dd-agent to have read/write access to the file. This not only fixes the error from being reported in the logs, but it also allows the agent to collect podman-managed containers running on the host.

I consider this to be a fairly substantial security risk and a bug and should be fixed. Alternatively, at least the "container" integration documentation should be updated to mention the read/write requirements to support podman.

VitaliyKulikov commented 1 year ago

666 is not helping me (

Dec 20 10:32:10 xxx agent[3907220]: 2022-12-20 10:32:10 UTC | CORE | WARN | (pkg/workloadmeta/store.go:362 in func1) | error pulling from collector "podman": error opening database /var/lib/containers/storage/libpod/bolt_state.db

$ sudo ls -l /var/lib/containers/storage/libpod/bolt_state.db
-rw-rw-rw- 1 root root 131072 Dec 19 23:55 /var/lib/containers/storage/libpod/bolt_state.db

rsumner commented 1 year ago

@VitaliyKulikov if you have selinux enabled, that could be causing you problems. check syslog audit logs for hints that selinux is blocking the reads/write system calls.

VitaliyKulikov commented 1 year ago

@rsumner thanks for the tip. it is Ubuntu 22.04.1 LTS box, so apparmor is there and I can't see any rules for such denial. also, I am using datadog agent v7.41.0.

jcorriher22 commented 1 year ago

Thank you so much! I have been fighting this with Redhat 8 servers, seemed to start with 8.6 and up.

priyarajeshh commented 1 year ago

We recently got this resolved through Request #1108661. Please refer to the ticket for more info.

Following are the two commands along with container.d/conf.yaml helped us in fixing the error and for DD to collect containers mentrics,

setfacl -R -m u:dd-agent:rx /var/lib/containers/ 
setfacl -R -m u:dd-agent:rwx /var/lib/containers/storage/libpod/bolt_state.db

Hope that helps.

rsumner commented 1 year ago

@priyarajeshh Your support tickets are not publicly visible, so no one can refer to the ticket for details except you and Datadog. Can you provide details as to what changes you made to the containerd./conf.yaml? The setfacl option is definitely better than doing a basic chmod -- thanks for relaying that info.

DataDog / datadog-agent

[BUG] Journalctl error pulling from collector "podman" on default install #13316