DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.83k stars 1.19k forks source link

[BUG] Datadog agent causing RPM database get corrupted #24171

Open rodehoed opened 5 months ago

rodehoed commented 5 months ago

Description Ok i'm not 100% confident that is a Datadog issue, but it's the only clue I have right now. Since march 22th we see (10) servers with getting their RPM DB corrupted. The facts:

Fixing the DB corruption will not prevent it from happening again. We have servers which have had this corruption multiple times now.

Agent Environment The agent is running 7.52.0-1 on RHEL 8.9

Describe what happened: The RPM database get's corrupted and calling the rpm/dnf command shows:

error: rpmdb: BDB0113 Thread/process 2421732/140117948610432 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
Error: Error: rpmdb open failed

Describe what you expected: Database not getting corrupted

Steps to reproduce the issue: Upgrading is enough, but don't know what triggers it.

Additional environment details (Operating System, Cloud provider, etc):

paulcacheux commented 5 months ago

Hello ! Thanks for reporting this issue, would you mind sharing:

Thanks a lot in advance

rodehoed commented 5 months ago

Hi @paulcacheux ,

Sure np.

The config comes from datadog-agent configcheck:


Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/container_image.d/conf.yaml.default
Config for instance ID: container_image:2ac6bde1700038e4
{}
~
Auto-discovery IDs:
* _container_image
===

=== container_lifecycle check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/container_lifecycle.d/conf.yaml.default
Config for instance ID: container_lifecycle:b628cf9ded5c9324
{}
~
Auto-discovery IDs:
* _container_lifecycle
===

=== cpu check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
Config for instance ID: cpu:e331d61ed1323219
{}
~
===

=== disk check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
Config for instance ID: disk:67cc0574430a16ba
use_mount: false
~
===

=== file_handle check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
Config for instance ID: file_handle:381b8b6ca58d37b0
{}
~
===

=== io check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
Config for instance ID: io:541b60d158de04a7
{}
~
===

=== load check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
Config for instance ID: load:bf7cea93fb3aa780
{}
~
===

=== memory check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
Config for instance ID: memory:3f1f6288b95b9979
{}
~
===

=== mysql check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/mysql.d/conf.yaml
Config for instance ID: mysql:75cd0f7a0853706d
options:
  disable_innodb_metrics: false
  extra_innodb_metrics: true
  extra_performance_metrics: true
  extra_status_metrics: true
  galera_cluster: true
  replication: 0
  schema_size_metrics: false
pass: "********"
port: 3306
server: 127.0.0.1
user: datadog
~
===

=== network check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
Config for instance ID: network:4b0649b7e11f0772
{}
~
===

=== nginx check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/nginx.d/conf.yaml
Config for instance ID: nginx:3833f3b9ceb3e496
nginx_status_url: http://not-my-host/nginx-status
~
Log Config:
logs:
- path: bogus/access.log
  service: staging.bogus.com
  source: nginx
  sourcecategory: http_web_access
  type: file
===

=== ntp check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
Config for instance ID: ntp:3c427a42a70bbf8
{}
~
===

=== php_fpm check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/php_fpm.d/conf.yaml
Config for instance ID: php_fpm:5726203bab636eaa
http_host: bogus-host
ping_reply: pong
ping_url: http://127.0.0.1/ping
status_url: http://127.0.0.1/fpmstatus
use_fastcgi: false
~
===

=== telemetry check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/telemetry.d/conf.yaml.default
Config for instance ID: telemetry:4d459fc318a47aa4
{}
~
===

=== uptime check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
Config for instance ID: uptime:c72f390abdefdf1a
{}
~
===``
paulcacheux commented 5 months ago

Could you share the following files if present:

/etc/datadog-agent/datadog.yaml
/etc/datadog-agent/system-probe.yaml
/etc/datadog-agent/security-agent.yaml

Thanks a lot !

rodehoed commented 5 months ago

sure:

### MANAGED BY PUPPET
---
api_key: xxxxxxxxxxxxxx
dd_url: ''
site: datadoghq.eu
cmd_port: 5001
hostname_fqdn: false
collect_ec2_tags: false
collect_gce_tags: false
confd_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags: []
apm_config:
  enabled: true
  env: none
  apm_non_local_traffic: false
process_config:
  enabled: 'true'
  scrub_args: true
  custom_sensitive_words: []
logs_enabled: true
logs_config:
  container_collect_all: false

The system-probe and security agent config are not active.

chouetz commented 5 months ago

Hello, The latest agent version comes with a new telemetry that reads data from rpm. To see if this one is the culprit, could you please try to disable it by setting

enable_signing_metadata_collection: false

in your datadog.yaml configuration and restart the Agent? Then fix the DB corruption and see if it stops this from happening?
Thanks in advance

rodehoed commented 5 months ago

Hi All,

As of today, this config is set. I will keep you posted.

Pythyu commented 5 months ago

Hi πŸ‘‹ Just a quick follow-up if you have any updates with the config option. Does the DB corruption still happens ? Thanks in advance

rodehoed commented 5 months ago

Hi @Pythyu

Well not any updates actually :-) I mean we don't have seen this message anymore the last weeks. So one might think the problem is "fixed".

Pythyu commented 5 months ago

Thanks you for all the answers πŸ˜ƒ Could you contact our support so we can get more information about your environment through not github ? It would help us a lot to reproduce the issue and potentially test the bug fix. You can share the ticket support id here, we'll follow it up

Pythyu commented 4 months ago

Hi @rodehoed πŸ‘‹ Please let us know if you got in touch with our support πŸ˜ƒ Thanks

rodehoed commented 4 months ago

Hi All,

Sorry for being late! I opened a ticket right now at DD with ticket id #689248