DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.2k forks source link

DD_AGENT_VERSION="7.21.1" Has slow memory leak #6270

Closed jgrobbel closed 4 years ago

jgrobbel commented 4 years ago

Output of the info page (if this is a bug)

See below

Describe what happened:

The agents are slowly using up memory until they key killed by kubernetes for exceeding its resources:

    State:          Running
      Started:      Thu, 20 Aug 2020 18:13:50 +0100
    Last State:     Terminated
      Reason:       OOMKilled <<<<<<<
      Exit Code:    0

Screenshot 2020-08-24 at 13 24 51

Describe what you expected:

Memory usage should return to normal after bursty events.

Steps to reproduce the issue:

Not 100% sure other than just running it.

Additional environment details (Operating System, Cloud provider, etc):

Running as a kubernetes daemonset on GCP. We are also using the jmx enabled image.

root@datadog-agent-tnz8z:/# agent status
Getting the status from the agent.

===============
Agent (v7.21.1)
===============

  Status date: 2020-08-24 12:33:10.118895 UTC
  Agent start: 2020-08-12 17:28:27.899667 UTC
  Pid: 378
  Go Version: go1.13.11
  Python Version: 3.8.1
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 6
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -717µs
    System UTC time: 2020-08-24 12:33:10.118895 UTC

  Host Info
  =========
    bootTime: 2020-08-12 17:27:22.000000 UTC
    kernelArch: x86_64
    kernelVersion: 4.19.112+
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: bullseye/sid
    procs: 216
    uptime: 1m9s
    virtualizationRole: guest

  Hostnames
  =========
    host_aliases: [gke-prestaging-orch-000-orc-pre-m-cuv-f1706f3a-k82s.ne-prestaging-w80j gke-prestaging-orch-000-orc-pre-m-cuv-f1706f3a-k82s-prestaging-orch-0001]
    hostname: gke-prestaging-orch-000-orc-pre-m-cuv-f1706f3a-k82s.c.ne-prestaging-w80j.internal
    socket-fqdn: datadog-agent-tnz8z
    socket-hostname: datadog-agent-tnz8z
    host tags:
      environment:prestaging
      cluster-name:prestaging-orch-0001
      orchestra:prestaging-orch-0001
      kube_cluster_name:prestaging-orch-0001
      cluster_name:prestaging-orch-0001
      zone:europe-west1-b
      internal-hostname:gke-prestaging-orch-000-orc-pre-m-cuv-f1706f3a-k82s.c.ne-prestaging-w80j.internal
      instance-id:5432819570720741505
      project:ne-prestaging-w80j
      numeric_project_id:358219109911
      cluster-location:europe-west1
      cluster-name:prestaging-orch-0001
      cluster-uid:b2e753a5ae64fa17b6cd4f70ed9ac8ecdde08527bc7f5792142c408edcbf93d3
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

  Metadata
  ========
    cloud_provider: GCP
    hostname_source: gce

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 67,939
      Metric Samples: Last Run: 6, Total: 407,628
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:33:07.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:07.000000 UTC

    disk (2.10.1)
    -------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 218, Total: 14,818,116
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 33ms
      Last Execution Date : 2020-08-24 12:32:59.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:59.000000 UTC

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 940, Total: 61,552,946
      Events: Last Run: 0, Total: 5,650
      Service Checks: Last Run: 1, Total: 67,938
      Average Execution Time : 144ms
      Last Execution Date : 2020-08-24 12:33:06.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:06.000000 UTC

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 5, Total: 339,690
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:32:58.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:58.000000 UTC

    haproxy (2.10.0)
    ----------------
      Instance ID: haproxy:dcec29f86281aa1c [OK]
      Configuration Source: kubelet:docker://927ed1a5c0419ad12dc0b354854b17cddc1ceced21e49832b91e59acdb2a86fb
      Total Runs: 22,430
      Metric Samples: Last Run: 362, Total: 8,110,754
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 94ms
      Last Execution Date : 2020-08-24 12:32:57.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:57.000000 UTC

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 208, Total: 14,176,696
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:33:05.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:05.000000 UTC

    kube_dns (2.4.1)
    ----------------
      Instance ID: kube_dns:9e2acb32d30599df [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kube_dns.d/auto_conf.yaml
      Total Runs: 57,618
      Metric Samples: Last Run: 84, Total: 4,816,192
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 13ms
      Last Execution Date : 2020-08-24 12:33:06.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:06.000000 UTC

    kubelet (4.1.1)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 1,117, Total: 73,145,128
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 271,752
      Average Execution Time : 545ms
      Last Execution Date : 2020-08-24 12:32:58.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:58.000000 UTC

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:33:04.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:04.000000 UTC

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 6, Total: 407,628
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:32:56.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:56.000000 UTC

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 17, Total: 1,154,946
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:33:03.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:03.000000 UTC

    neo4j (0.0.1)
    -------------
      Instance ID: neo4j:68addb1f3df3b3d5 [OK]
      Configuration Source: kubelet:docker://59afd97febdb96ff8e94afdedbb23a408a5b845dbd6bb2bb6dc6cf4dbdf32aa9
      Total Runs: 666
      Metric Samples: Last Run: 387, Total: 257,742
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 85ms
      Last Execution Date : 2020-08-24 12:33:03.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:03.000000 UTC

      Instance ID: neo4j:9ee0fcd48bf6796a [OK]
      Configuration Source: kubelet:docker://45ce9e2b10266d031f1a5a612547531b69b1a18782a10da883ca0bc60da6f613
      Total Runs: 14
      Metric Samples: Last Run: 387, Total: 5,418
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 96ms
      Last Execution Date : 2020-08-24 12:33:08.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:08.000000 UTC

    network (1.17.0)
    ----------------
      Instance ID: network:5c571333f400457d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 31, Total: 2,106,078
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2ms
      Last Execution Date : 2020-08-24 12:32:55.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:32:55.000000 UTC

    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1,133
      Metric Samples: Last Run: 1, Total: 1,133
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1,133
      Average Execution Time : 491ms
      Last Execution Date : 2020-08-24 12:28:37.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:28:37.000000 UTC

    prometheus (3.3.0)
    ------------------
      Instance ID: prometheus:neo4joperator:82eb03b232f67a9 [OK]
      Configuration Source: kubelet:docker://0baab6bf231c9e6e9aa557a018bde4a3b5e64ee2a2c515fd4e6efd62896dfd7e
      Total Runs: 253
      Metric Samples: Last Run: 126, Total: 31,878
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 253
      Average Execution Time : 19ms
      Last Execution Date : 2020-08-24 12:33:09.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:09.000000 UTC

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 67,938
      Metric Samples: Last Run: 1, Total: 67,938
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-08-24 12:33:02.000000 UTC
      Last Successful Execution Date : 2020-08-24 12:33:02.000000 UTC

  Loading Errors
  ==============
    neo4j_enterprise
    ----------------
      Core Check Loader:
        Check neo4j_enterprise not found in Catalog

      JMX Check Loader:
        check is not a jmx check, or unable to determine if it's so

      Python Check Loader:
        unable to import module 'neo4j_enterprise': No module named 'neo4j_enterprise'

========
JMXFetch
========

  Initialized checks
  ==================
    jmx
      instance_name : jmx-10.8.2.133-3637
      message : <no value>
      metric_count : 27
      service_check_count : 0
      status : OK
      instance_name : jmx-10.8.2.183-3637
      message : <no value>
      metric_count : 27
      service_check_count : 0
      status : OK
  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 67,938
    Connections: 0
    Containers: 0
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 8,591
    Metadata: 0
    Pods: 0
    Processes: 0
    RTContainers: 0
    RTProcesses: 0
    Requeued: 3
    Retried: 3
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 144,467
    TimeseriesV1: 67,938

  Transaction Errors
  ==================
    Total number: 3
    Errors By Type:

  HTTP Errors
  ==================
    Total number: 3
    HTTP Errors By Code:
      500: 3

  API Keys status
  ===============
    API key ending with 6fd1d: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 6fd1d

==========
Logs Agent
==========

  Logs Agent is not running

=========
APM Agent
=========
  Status: Running
  Pid: 382
  Uptime: 1.019082e+06 seconds
  Mem alloc: 16,785,536 bytes
  Hostname: gke-prestaging-orch-000-orc-pre-m-cuv-f1706f3a-k82s.c.ne-prestaging-w80j.internal
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ==========================
    No traces received in the previous minute.
    Default priority sampling rate: 100.0%

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

=========
Aggregator
=========
  Checks Metric Sample: 241,872,305
  Dogstatsd Metric Sample: 14,466,832
  Event: 5,651
  Events Flushed: 5,651
  Number Of Flushes: 67,938
  Series Flushed: 231,732,595
  Service Check: 1,564,093
  Service Checks Flushed: 1,632,015

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 14,466,831
  Metric Parse Errors: 0
  Service Check Packets: 132,952
  Service Check Parse Errors: 0
  Udp Bytes: 5,120,405,215
  Udp Packet Reading Errors: 0
  Udp Packets: 4,600,796
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

root@datadog-agent-tnz8z:/#
truthbk commented 4 years ago

Hi @jgrobbel, sometimes the agent RSS memory can take over 24h to stabilize, this is due to GC behavior.

Can you by any chance increase the memory limits on the agent pod and confirm the RSS continues to increase passed 24 hours? I'm not saying we're not leaking, but we're running 7.21.1 internally and haven't found any leakage problems; that said the problem could also be in one of the integrations (perhaps something we don't use ourselves).

jgrobbel commented 4 years ago

@truthbk Sorry for the delay. I suspect you are right, in other places we are running agents without the leak. It seems related to one of the integrations as you suggest. I will close this for now while I work at isolating where the leak is. Thanks.