DataDog / dd-agent

Datadog Agent Version 5
https://docs.datadoghq.com/
Other
1.3k stars 815 forks source link

Agent 5.9.1 on Alpine: CPU usage at 100% constantly #2945

Open dennari opened 7 years ago

dennari commented 7 years ago

I'm running the containerized Alpine version and for some reason after a while the CPU usage jumps to 100% and stays there.

screen shot 2016-10-21 at 2 17 31 pm screen shot 2016-10-21 at 2 12 48 pm
===================
Collector (v 5.9.1)
===================

  Status date: 2016-10-21 11:14:39 (8s ago)
  Pid: 26103
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/collector.log

  Clocks
  ======

    NTP offset: -0.0058 s
    System UTC time: 2016-10-21 11:14:47.875449

  Paths
  =====

    conf.d: /opt/datadog-agent/agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Hostnames
  =========

    ec2-hostname: ip-10-0-0-95.eu-west-1.compute.internal
    local-ipv4: 10.0.0.95
    local-hostname: ip-10-0-0-95.eu-west-1.compute.internal
    socket-hostname: a3e77c5ccce1
    public-hostname: ec2-52-211-182-23.eu-west-1.compute.amazonaws.com
    hostname: ip-10-0-0-95.eu-west-1.compute.internal
    instance-id: i-44e4f952
    public-ipv4: 52.211.182.23
    socket-fqdn: a3e77c5ccce1

  Checks
  ======

    nginx
    -----
      - instance #0 [OK]
      - Collected 7 metrics, 0 events & 2 service checks

    ntp
    ---
      - Collected 0 metrics, 0 events & 1 service check

    disk
    ----
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 1 service check

    docker_daemon
    -------------
      - instance #0 [OK]
      - Collected 108 metrics, 0 events & 2 service checks

    http_check
    ----------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [OK]
      - instance #3 [OK]
      - Collected 4 metrics, 0 events & 9 service checks

  Emitters
  ========

    - http_emitter [OK]

===================
Dogstatsd (v 5.9.1)
===================

  Status date: 2016-10-21 11:14:38 (9s ago)
  Pid: 16
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/dogstatsd.log

  Flush count: 12871
  Packet Count: 0
  Packets per second: 0.0
  Metric count: 1
  Event count: 0
  Service check count: 0

===================
Forwarder (v 5.9.1)
===================

  Status date: 2016-10-21 11:14:43 (4s ago)
  Pid: 17
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/forwarder.log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 43388
  Transactions received: 19253
  Transactions flushed: 19253
  Transactions rejected: 0

[ec2-user@ip-10-0-0-95 villev]$ docker exec a3e77c5ccce1 /opt/datadog-agent/bin/agent info
===================
Collector (v 5.9.1)
===================

  Status date: 2016-10-21 11:14:58 (11s ago)
  Pid: 26103
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/collector.log

  Clocks
  ======

    NTP offset: 0.0012 s
    System UTC time: 2016-10-21 11:15:10.362467

  Paths
  =====

    conf.d: /opt/datadog-agent/agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Checks
  ======

    nginx
    -----
      - instance #0 [OK]
      - Collected 7 metrics, 0 events & 2 service checks

    ntp
    ---
      - Collected 0 metrics, 0 events & 1 service check

    disk
    ----
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 1 service check

    docker_daemon
    -------------
      - instance #0 [OK]
      - Collected 108 metrics, 1 event & 2 service checks

    http_check
    ----------
      - instance #0 [OK]
      - instance #1 [OK]
      - instance #2 [OK]
      - instance #3 [OK]
      - Collected 4 metrics, 0 events & 9 service checks

  Emitters
  ========

    - http_emitter [OK]

===================
Dogstatsd (v 5.9.1)
===================

  Status date: 2016-10-21 11:15:08 (1s ago)
  Pid: 16
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/dogstatsd.log

  Flush count: 12874
  Packet Count: 0
  Packets per second: 0.0
  Metric count: 1
  Event count: 0
  Service check count: 0

===================
Forwarder (v 5.9.1)
===================

  Status date: 2016-10-21 11:15:08 (2s ago)
  Pid: 17
  Platform: Linux-4.4.19-29.55.amzn1.x86_64-x86_64-with
  Python Version: 2.7.12, 64bit
  Logs: <stderr>, /opt/datadog-agent/logs/forwarder.log

  Queue Size: 447 bytes
  Queue Length: 1
  Flush Count: 43396
  Transactions received: 19257
  Transactions flushed: 19256
  Transactions rejected: 0
hkaj commented 7 years ago

Hi @dennari Thanks for notifying us of this. To help us investigate, could you please send us a flare from this agent when it's pegging cpu? Instructions can be found here.

dennari commented 7 years ago

@hkaj, unfortunately the flare command is not running cleanly. It freezes at /tmp/datadog-agent-2016-10-26-12-39-10.tar.bz2 is going to be uploaded to Datadog and nothing's happening after that.

$ docker exec 4d61f6506e73 /opt/datadog-agent/bin/agent flare
2016-10-26 12:39:10,574 | INFO | dd.collector | utils.flare(flare.py:132) | Collecting logs and configuration files:
2016-10-26 12:39:10,576 | INFO | dd.collector | utils.flare(flare.py:372) |   * /opt/datadog-agent/logs/collector.log
2016-10-26 12:39:10,576 | INFO | dd.collector | utils.flare(flare.py:372) |   * /opt/datadog-agent/logs/forwarder.log
2016-10-26 12:39:10,577 | INFO | dd.collector | utils.flare(flare.py:372) |   * /opt/datadog-agent/logs/dogstatsd.log
2016-10-26 12:39:10,577 | INFO | dd.collector | utils.flare(flare.py:372) |   * /opt/datadog-agent/logs/jmxfetch.log
2016-10-26 12:39:10,578 | INFO | dd.collector | utils.flare(flare.py:372) |   * /opt/datadog-agent/logs/supervisord.log
2016-10-26 12:39:10,579 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/datadog.conf
2016-10-26 12:39:10,579 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/supervisor.conf
2016-10-26 12:39:10,580 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/nginx.yaml
2016-10-26 12:39:10,581 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/docker_daemon.yaml
2016-10-26 12:39:10,581 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/http_check.yaml
2016-10-26 12:39:10,582 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/agent_metrics.yaml.default
2016-10-26 12:39:10,583 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/disk.yaml.default
2016-10-26 12:39:10,583 | INFO | dd.collector | utils.flare(flare.py:383) |   * /opt/datadog-agent/agent/conf.d/ntp.yaml.default
2016-10-26 12:39:10,583 | INFO | dd.collector | utils.flare(flare.py:141) |   * datadog-agent configcheck output
2016-10-26 12:39:10,594 | INFO | dd.collector | utils.flare(flare.py:143) |   * service discovery configcheck output
2016-10-26 12:39:10,595 | INFO | dd.collector | utils.flare(flare.py:145) |   * datadog-agent status output
2016-10-26 12:39:10,878 | INFO | dd.collector | utils.flare(flare.py:147) |   * datadog-agent info output
2016-10-26 12:39:10,897 | INFO | dd.collector | utils.flare(flare.py:150) |   * pip freeze
2016-10-26 12:39:11,152 | INFO | dd.collector | utils.flare(flare.py:154) |   * log permissions on collected files
2016-10-26 12:39:11,153 | INFO | dd.collector | utils.flare(flare.py:135) | Saving all files to /tmp/datadog-agent-2016-10-26-12-39-10.tar.bz2
/tmp/datadog-agent-2016-10-26-12-39-10.tar.bz2 is going to be uploaded to Datadog.
hkaj commented 7 years ago

@dennari the command is interactive. Try with docker exec -it 4d61f6506e73 /opt/datadog-agent/bin/agent flare ?

dennari commented 7 years ago

Ah ok, thanks. Now I got it submitted. The case is #69876.

hkaj commented 7 years ago

thanks @dennari I'll have a look asap

gmalouf commented 7 years ago

Hi @dennari, were you able to get around this by using the non-alpine version? I'm seeing the same issue for datadog/docker-dd-agent:11.0.5141-alpine constantly.

yarivat commented 6 years ago

Hi I am having the same issue plain aws linux ec2 with some node services and pm2 that runs them , the supervisord process has a constant 100% usage

andrebrov commented 6 years ago

Hi guys,

We have the same issue. Any news on fixes?