DataDog Cluster Agent v7.46.0 has slow memory leak

DLakin01 commented 10 months ago

Agent Environment

We are running the DataDog cluster agent as a deployment on AWS EKS, K8s version 1.26. Here is the output of agent status on one of the pods experiencing the leak:

2023-12-21 18:51:51 UTC | CLUSTER | WARN | (pkg/util/log/log.go:618 in func1) | Unknown environment variable: DD_POD_NAME
2023-12-21 18:51:51 UTC | CLUSTER | WARN | (pkg/util/log/log.go:618 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
Getting the status from the agent.
2023-12-21 18:51:51 UTC | CLUSTER | INFO | (pkg/util/log/log.go:590 in func1) | 2 Features detected from environment: kubernetes,orchestratorexplorer

===============================
Datadog Cluster Agent (v7.46.0)
===============================

  Status date: 2023-12-21 18:51:51.612 UTC (1703184711612)
  Agent start: 2023-10-12 17:34:18.794 UTC (1697132058794)
  Pid: 1
  Go Version: go1.19.10
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System time: 2023-12-21 18:51:51.612 UTC (1703184711612)

  Hostnames
  =========
    host_aliases: [i-0286ce77ff1a8d387]
    hostname: ip-10-196-85-58.us-west-2.compute.internal-xxxxxxxx-xxx-xxx
    socket-fqdn: datadog-cluster-agent-768d84fffd-2phwf
    socket-hostname: datadog-cluster-agent-768d84fffd-2phwf
    hostname provider: container
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
      aws: Unable to determine hostname from EC2: status code 401 trying to GET http://169.254.169.254/latest/meta-data/instance-id
      azure: azure_hostname_style is set to 'os'
      fargate: agent is not runnning on Fargate
      fqdn: FQDN hostname is not usable
      gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 401 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
      os: OS hostname is not usable

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-768d84fffd-2phwf
  Last Acquisition of the lease: Thu, 12 Oct 2023 17:35:01 UTC
  Renewed leadership: Thu, 21 Dec 2023 18:51:51 UTC
  Number of leader transitions: 41 transitions

Custom Metrics Server
=====================

  Data sources
  ------------
  URL: https://api.datadoghq.com

  External metrics provider uses DatadogMetric - Check status directly from Kubernetes with: `kubectl get datadogmetric`

Cluster Checks Dispatching
==========================
  Status: Leader, serving requests
  Active agents: 2
  Check Configurations: 3
    - Dispatched: 3
    - Unassigned: 0

Admission Controller
====================

    Webhooks info
    -------------
      MutatingWebhookConfigurations name: datadog-webhook
      Created at: 2022-08-10T15:53:20Z
      ---------
        Name: datadog.webhook.auto.instrumentation
        CA bundle digest: e69965098279567
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: monitoring/datadog-cluster-agent-admission-controller - Port: 443 - Path: /injectlib
      ---------
        Name: datadog.webhook.config
        CA bundle digest: e69965098279567
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: monitoring/datadog-cluster-agent-admission-controller - Port: 443 - Path: /injectconfig
      ---------
        Name: datadog.webhook.tags
        CA bundle digest: e69965098279567
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: monitoring/datadog-cluster-agent-admission-controller - Port: 443 - Path: /injecttags

    Secret info
    -----------
    Secret name: webhook-certificate
    Secret namespace: monitoring
    Created at: 2022-08-10T15:53:19Z
    CA bundle digest: e69965098279567
    Duration before certificate expiration: 4845h6m20.370172864s

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 403,510
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 1, Total: 353,678
      Service Checks: Last Run: 5, Total: 2,017,355
      Average Execution Time : 1.909s
      Last Execution Date : 2023-12-21 18:51:37 UTC (1703184697000)
      Last Successful Execution Date : 2023-12-21 18:51:37 UTC (1703184697000)

    orchestrator
    ------------
      Instance ID: orchestrator:c640d4e943da6c1d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 605,266
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 50ms
      Last Execution Date : 2023-12-21 18:51:50 UTC (1703184710000)
      Last Successful Execution Date : 2023-12-21 18:51:50 UTC (1703184710000)

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 605,260
    ClusterRole: 81,883
    ClusterRoleBinding: 73,395
    CronJob: 196,156
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 54,854
    Deployment: 72,217
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 32,507
    Job: 298,182
    Namespace: 605,260
    Node: 108,119
    OrchestratorManifest: 702,578
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 7
    ReplicaSet: 125,489
    Requeued: 23
    Retried: 23
    RetryQueueSize: 0
    Role: 37,464
    RoleBinding: 38,413
    Service: 52,952
    ServiceAccount: 85,446
    StatefulSet: 0
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 4.088426e+06
    Successes By Endpoint:
      check_run_v1: 403,510
      intake: 111,224
      orchestrator: 2,467,604
      orchmanifest: 702,578
      series_v2: 403,510

  Transaction Errors
  ==================
    Total number: 3
    Errors By Type:

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 990b0

=====================
Orchestrator Explorer
=====================
  Collection Status: The collection is at least partially running since the cache has been populated.
  Cluster Name: xxxxxxxxxx-xxx-xxx
  Cluster ID: 8d31b2bb-1676-4d67-ac9b-4c13fc608d2e
  Container scrubbing: enabled
  Manifest collection: enabled

  ======================
  Orchestrator Endpoints
  ======================
    https://orchestrator.datadoghq.com - API Key ending with: 990b0

  ===========
  Cache Stats
  ===========
    Elements in the cache: 517

    ClusterRoleBinding
      Last Run: (Hits: 105 Miss: 1) | Total: (Hits: 6.0328655e+07 Miss: 3.421972e+06)

    ClusterRole
      Last Run: (Hits: 120 Miss: 1) | Total: (Hits: 6.8920838e+07 Miss: 3.908689e+06)

    Cluster
      Last Run: (Hits: 0 Miss: 1) | Total: (Hits: 0 Miss: 605260)

    CronJob
      Last Run: (Hits: 4 Miss: 0) | Total: (Hits: 2.191971e+06 Miss: 229069)

    DaemonSet
      Last Run: (Hits: 7 Miss: 0) | Total: (Hits: 3.623867e+06 Miss: 206014)

    Deployment
      Last Run: (Hits: 18 Miss: 0) | Total: (Hits: 1.0309211e+07 Miss: 585481)

    Ingress
      Last Run: (Hits: 2 Miss: 0) | Total: (Hits: 760427 Miss: 43156)

    Job
      Last Run: (Hits: 15 Miss: 1) | Total: (Hits: 8.684529e+06 Miss: 636450)

    Namespace
      Last Run: (Hits: 11 Miss: 1) | Total: (Hits: 6.29963e+06 Miss: 963490)

    Node
      Last Run: (Hits: 3 Miss: 0) | Total: (Hits: 1.69772e+06 Miss: 118727)

    PersistentVolumeClaim
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    PersistentVolume
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Pod
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 10)

    ReplicaSet
      Last Run: (Hits: 55 Miss: 1) | Total: (Hits: 4.349065e+07 Miss: 2.4678e+06)

    RoleBinding
      Last Run: (Hits: 30 Miss: 0) | Total: (Hits: 1.7183497e+07 Miss: 974318)

    Role
      Last Run: (Hits: 27 Miss: 0) | Total: (Hits: 1.5465208e+07 Miss: 876829)

    ServiceAccount
      Last Run: (Hits: 75 Miss: 0) | Total: (Hits: 4.2567131e+07 Miss: 2.420444e+06)

    Service
      Last Run: (Hits: 18 Miss: 0) | Total: (Hits: 9.923741e+06 Miss: 563996)

    StatefulSet
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    VerticalPodAutoscaler
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

  =====================
  Manifest Buffer Stats
  =====================
  Buffer Flushed : 702580 times
  Last Time Flushed Manifests : 64
  ==============================
  Manifests Flushed Per Resource
  ==============================
    ClusterRole : 3.908689e+06
    ClusterRoleBinding : 3.421972e+06
    CronJob : 229069
    DaemonSet : 206014
    Deployment : 585481
    Ingress : 43156
    Job : 636450
    Namespace : 963490
    Node : 118727
    Pod : 10
    ReplicaSet : 2.4678e+06
    Role : 876829
    RoleBinding : 974318
    Service : 563996
    ServiceAccount : 2.420444e+06

Describe what happened:

The DataDog cluster agent is experiencing a slow memory leak, as can clearly be seen in the below DataDog graphs:

% of CPU requests over three months (each line is a single deployment)

% of memory requests over three months (each line is a single deployment)

Describe what you expected: Memory and CPU usage to hold roughly steady over time and not increase forever

Steps to reproduce the issue: Unknown, but this is occurring across all our EKS clusters

Additional environment details (Operating System, Cloud provider, etc): AWS EKS, nodes are running an AMI built on Amazon Linux 2

apptio-speravali commented 9 months ago

I also have similar issue.

lteixeira-dock commented 9 months ago

I have the same issue.

AWS EKS 1.26 running "gcr.io/datadoghq/cluster-agent:7.46.0" in Fargate pods.

DanielCastronovo commented 3 months ago

Same issue

changhyuni commented 1 month ago

I have the same issue.

jessgoldq4 commented 1 month ago

Same issue on the latest version (v7.57.2). Does anyone know a version of the datadog-agent that doesn't have this issue?

DataDog / datadog-agent

DataDog Cluster Agent v7.46.0 has slow memory leak #21726