DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
346 stars 1.01k forks source link

[bug] Error collection metrics from PODs with Prometheus endpoint #310

Open carlosjgp opened 3 years ago

carlosjgp commented 3 years ago

(Maybe it's an agent bug?)

Output of the info page (if this is a bug) Instructions for K8s don't work

$ kexn datadog datadog-xdjv2 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulting container name to agent.
Use 'kubectl describe pod/datadog-xdjv2 -n datadog' to see all of the containers in this pod.
$ root@datadog-xdjv2:/# s6-svstat /var/run/s6/services/agent/
s6-svstat: fatal: unable to read status for /var/run/s6/services/agent/: s6-supervise not running

agent status

root@datadog-xdjv2:/# agent status
Getting the status from the agent.

===============
Agent (v7.29.0)
===============

  Status date: 2021-07-08 09:34:15.283 UTC (1625736855283)
  Agent start: 2021-07-08 09:24:15.859 UTC (1625736255859)
  Pid: 1
  Go Version: go1.15.11
  Python Version: 3.8.10
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -32.787ms
    System time: 2021-07-08 09:34:15.283 UTC (1625736855283)

  Host Info
  =========
    bootTime: 2021-07-07 12:07:18 UTC (1625659638000)
    kernelArch: x86_64
    kernelVersion: 5.4.105-48.177.amzn2.x86_64
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 21.04
    procs: 795
    uptime: 21h17m9s

  Hostnames
  =========
    ec2-hostname: <redacted>
    host_aliases: [<redacted>]
    hostname: <redacted>
    instance-id: <redacted>
    socket-fqdn: datadog-xdjv2
    socket-hostname: datadog-xdjv2
    host tags:
      cluster_name:<redacted>
      env:<redacted>
      kube_cluster_name:<redacted>
      stack_name:<redacted>
      stack_type:<redacted>
    hostname provider: aws
    unused hostname providers:
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========
    cloud_provider: AWS
    hostname_source: aws

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 40
      Metric Samples: Last Run: 9, Total: 353
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:09 UTC (1625736849000)
      Last Successful Execution Date : 2021-07-08 09:34:09 UTC (1625736849000)

    disk (4.3.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 39
      Metric Samples: Last Run: 220, Total: 8,580
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 72ms
      Last Execution Date : 2021-07-08 09:34:01 UTC (1625736841000)
      Last Successful Execution Date : 2021-07-08 09:34:01 UTC (1625736841000)

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 39
      Metric Samples: Last Run: 2,996, Total: 116,768
      Events: Last Run: 1, Total: 1
      Service Checks: Last Run: 1, Total: 39
      Average Execution Time : 326ms
      Last Execution Date : 2021-07-08 09:34:08 UTC (1625736848000)
      Last Successful Execution Date : 2021-07-08 09:34:08 UTC (1625736848000)

    elastic (3.0.0)
    ---------------
      Instance ID: elastic:1def8ccb02d4de4 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/elastic.d/auto_conf.yaml
      Total Runs: 39
      Metric Samples: Last Run: 247, Total: 9,633
      Events: Last Run: 1, Total: 1
      Service Checks: Last Run: 2, Total: 78
      Average Execution Time : 27ms
      Last Execution Date : 2021-07-08 09:34:01 UTC (1625736841000)
      Last Successful Execution Date : 2021-07-08 09:34:01 UTC (1625736841000)
      metadata:
        version.major: 7
        version.minor: 9
        version.patch: 3
        version.raw: 7.9.3
        version.scheme: semver

      Instance ID: elastic:84c75561b853236 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/elastic.d/auto_conf.yaml
      Total Runs: 40
      Metric Samples: Last Run: 247, Total: 9,880
      Events: Last Run: 1, Total: 1
      Service Checks: Last Run: 2, Total: 80
      Average Execution Time : 47ms
      Last Execution Date : 2021-07-08 09:34:10 UTC (1625736850000)
      Last Successful Execution Date : 2021-07-08 09:34:10 UTC (1625736850000)
      metadata:
        version.major: 7
        version.minor: 9
        version.patch: 3
        version.raw: 7.9.3
        version.scheme: semver

      Instance ID: elastic:8cb5820684ff7816 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/elastic.d/auto_conf.yaml
      Total Runs: 39
      Metric Samples: Last Run: 247, Total: 9,633
      Events: Last Run: 1, Total: 1
      Service Checks: Last Run: 2, Total: 78
      Average Execution Time : 24ms
      Last Execution Date : 2021-07-08 09:34:09 UTC (1625736849000)
      Last Successful Execution Date : 2021-07-08 09:34:09 UTC (1625736849000)
      metadata:
        version.major: 7
        version.minor: 9
        version.patch: 3
        version.raw: 7.9.3
        version.scheme: semver

      Instance ID: elastic:aa44bcca817f9046 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/elastic.d/auto_conf.yaml
      Total Runs: 39
      Metric Samples: Last Run: 247, Total: 9,633
      Events: Last Run: 1, Total: 1
      Service Checks: Last Run: 2, Total: 78
      Average Execution Time : 24ms
      Last Execution Date : 2021-07-08 09:34:02 UTC (1625736842000)
      Last Successful Execution Date : 2021-07-08 09:34:02 UTC (1625736842000)
      metadata:
        version.major: 7
        version.minor: 9
        version.patch: 3
        version.raw: 7.9.3
        version.scheme: semver

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 40
      Metric Samples: Last Run: 5, Total: 200
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:15 UTC (1625736855000)
      Last Successful Execution Date : 2021-07-08 09:34:15 UTC (1625736855000)

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 39
      Metric Samples: Last Run: 91, Total: 3,486
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:07 UTC (1625736847000)
      Last Successful Execution Date : 2021-07-08 09:34:07 UTC (1625736847000)

    kubelet (7.0.0)
    ---------------
      Instance ID: kubelet:5bbc63f3938c02f4 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 30
      Metric Samples: Last Run: 3,070, Total: 92,026
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 120
      Average Execution Time : 1.362s
      Last Execution Date : 2021-07-08 09:34:05 UTC (1625736845000)
      Last Successful Execution Date : 2021-07-08 09:34:05 UTC (1625736845000)

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 40
      Metric Samples: Last Run: 6, Total: 240
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:14 UTC (1625736854000)
      Last Successful Execution Date : 2021-07-08 09:34:14 UTC (1625736854000)

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 39
      Metric Samples: Last Run: 18, Total: 702
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:06 UTC (1625736846000)
      Last Successful Execution Date : 2021-07-08 09:34:06 UTC (1625736846000)

    network (2.1.2)
    ---------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 40
      Metric Samples: Last Run: 385, Total: 15,400
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 6ms
      Last Execution Date : 2021-07-08 09:34:13 UTC (1625736853000)
      Last Successful Execution Date : 2021-07-08 09:34:13 UTC (1625736853000)

    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 1, Total: 1
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 66ms
      Last Execution Date : 2021-07-08 09:24:24 UTC (1625736264000)
      Last Successful Execution Date : 2021-07-08 09:24:24 UTC (1625736264000)

    prometheus (3.3.1)
    ------------------
      Instance ID: prometheus:<redacted>:510d683f42d394ac [ERROR]
      Configuration Source: kubelet:docker://9e4c3c2224f58bb014afbbc1a64ade2c4fdb4c938ab413a30044ffd09504e14e
      Total Runs: 40
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 40
      Average Execution Time : 4ms
      Last Execution Date : 2021-07-08 09:34:12 UTC (1625736852000)
      Last Successful Execution Date : Never
      Error: 406 Client Error: Not Acceptable for url: http://10.129.31.28:11000/metrics
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 999, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/base_check.py", line 106, in check
          scraper.process(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 408, in process
          for metric in self.scrape_metrics(endpoint, instance=instance):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 370, in scrape_metrics
          response = self.poll(endpoint, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 592, in poll
          response.raise_for_status()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status
          raise HTTPError(http_error_msg, response=self)
      requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url: http://10.129.31.28:11000/metrics
      Instance ID: prometheus:<redacted>:59a8726baec0a559 [ERROR]
      Configuration Source: kubelet:docker://f44fb2221f0fd7328ce7a304bdfaecc4abd6a68066a7b5d4fa129cd927d9cabd
      Total Runs: 39
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 39
      Average Execution Time : 3ms
      Last Execution Date : 2021-07-08 09:34:03 UTC (1625736843000)
      Last Successful Execution Date : Never
      Error: 406 Client Error: Not Acceptable for url: http://10.129.19.53:11000/metrics
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 999, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/base_check.py", line 106, in check
          scraper.process(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 408, in process
          for metric in self.scrape_metrics(endpoint, instance=instance):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 370, in scrape_metrics
          response = self.poll(endpoint, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 592, in poll
          response.raise_for_status()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status
          raise HTTPError(http_error_msg, response=self)
      requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url: http://10.129.19.53:11000/metrics
      Instance ID: prometheus:<redacted>:a42fee3bceaf84fa [ERROR]
      Configuration Source: kubelet:docker://aaebfc282afa5f23a71ce50aac061e6e79f0a20d82354810c6ffd5414d6e1724
      Total Runs: 39
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 39
      Average Execution Time : 3ms
      Last Execution Date : 2021-07-08 09:34:04 UTC (1625736844000)
      Last Successful Execution Date : Never
      Error: 406 Client Error: Not Acceptable for url: http://10.129.14.23:11000/metrics
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 999, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/base_check.py", line 106, in check
          scraper.process(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 408, in process
          for metric in self.scrape_metrics(endpoint, instance=instance):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 370, in scrape_metrics
          response = self.poll(endpoint, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py", line 592, in poll
          response.raise_for_status()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status
          raise HTTPError(http_error_msg, response=self)
      requests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url: http://10.129.14.23:11000/metrics
      Instance ID: prometheus:node_problem_detector:26da75a46c11211e [OK]
      Configuration Source: kubelet:docker://5ec7d480ff3c596cfc989f2a0dc577cb3c48db0d40ed81d2ef22c8379a163f45
      Total Runs: 40
      Metric Samples: Last Run: 18, Total: 720
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 40
      Average Execution Time : 4ms
      Last Execution Date : 2021-07-08 09:34:11 UTC (1625736851000)
      Last Successful Execution Date : 2021-07-08 09:34:11 UTC (1625736851000)

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 39
      Metric Samples: Last Run: 1, Total: 39
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2021-07-08 09:34:05 UTC (1625736845000)
      Last Successful Execution Date : 2021-07-08 09:34:05 UTC (1625736845000)

========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    Clusters: 0
    CronJobs: 0
    DaemonSets: 0
    Deployments: 0
    Dropped: 0
    DroppedOnInput: 0
    Jobs: 0
    Nodes: 0
    Pods: 0
    ReplicaSets: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Services: 0

  Transaction Successes
  =====================
    Total number: 110
    Successes By Endpoint:
      check_run_v1: 39
      intake: 5
      series_v1: 66

  API Keys status
  ===============
    API key ending with 330c5: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 330c5

==========
Logs Agent
==========

  Logs Agent is not running

=========
APM Agent
=========
  Status: Running
  Pid: 1
  Uptime: 596 seconds
  Mem alloc: 14,369,320 bytes
  Hostname: i-0c35ba86c8f057721
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ==========================
    From python 3.8.11 (CPython), client 0.37.3
      Traces received: 98 (37,225 bytes)
      Spans received: 98

    From python 3.8.11 (CPython), client 0.46.0
      Traces received: 150 (110,097 bytes)
      Spans received: 198

    From python 3.8.10 (CPython), client 0.46.0
      Traces received: 112 (220,020 bytes)
      Spans received: 952

    From python 3.8.10 (CPython), client 0.37.3
      Traces received: 56 (17,360 bytes)
      Spans received: 56

    From python 3.8.10 (CPython), client 0.40.2
      Traces received: 28 (9,772 bytes)
      Spans received: 28

    From python 3.8.11 (CPython), client 0.45.0
      Traces received: 2 (1,539 bytes)
      Spans received: 4

    From python 3.8.2 (CPython), client 0.48.0
      Traces received: 42 (46,172 bytes)
      Spans received: 126

    Default priority sampling rate: 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>-manager,env:': 100.0%
    Priority sampling rate for 'service:<redacted>-manager,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%
    Priority sampling rate for 'service:<redacted>,env:': 100.0%
    Priority sampling rate for 'service:<redacted>,env:<redacted>': 100.0%

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

=========
Aggregator
=========
  Checks Metric Sample: 278,696
  Dogstatsd Metric Sample: 19,585
  Event: 6
  Events Flushed: 6
  Number Of Flushes: 39
  Series Flushed: 272,422
  Service Check: 1,333
  Service Checks Flushed: 1,346
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 19,584
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 2,840,508
  Udp Packet Reading Errors: 0
  Udp Packets: 5,892
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 1
  Unterminated Metric Errors: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://172.20.36.167:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.13.1+commit.b6652eb

Describe what happened: Getting this error on the agent logs and metrics are missing

2021-07-08 09:25:54 UTC | CORE | INFO | (pkg/collector/runner/runner.go:336 in work) | check:elastic | Done running check, next runs will be logged every 500 runs
2021-07-08 09:25:57 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:301 in work) | Error running check prometheus: [{"message": "406 Client Error: Not Acceptable for url: http://10.129.31.28:11000/metrics", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 999, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/base_check.py\", line 106, in check\n    scraper.process(\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 408, in process\n    for metric in self.scrape_metrics(endpoint, instance=instance):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 370, in scrape_metrics\n    response = self.poll(endpoint, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 592, in poll\n    response.raise_for_status()\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py\", line 940, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 406 Client Error: Not Acceptable for url: http://10.129.31.28:11000/metrics\n"}]

Annotations on the PODs

 apiVersion: v1
 kind: Pod
 metadata:
   annotations:
     ad.datadoghq.com/<redacted>.check_names: |
       ["prometheus"]
     ad.datadoghq.com/<redacted>.init_configs: |
       [{}]
     ad.datadoghq.com/<redacted>.instances: |
       [
         {
           "prometheus_url": "http://%%host%%:11000/metrics",
           "namespace": "<redacted>",
           "metrics": [
             "*"
           ]
         }
       ]

this endpoint works when I forward it locally

$ http :11000/metrics
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: origin, content-type, accept, authorization
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS, HEAD
Access-Control-Allow-Origin: *
Access-Control-Max-Age: 1209600
Connection: keep-alive
Content-Length: 85548
Content-Type: text/plain
Date: Thu, 08 Jul 2021 11:12:17 GMT

# HELP base_cpu_processCpuLoad Displays the "recent cpu usage" for the Java Virtual Machine process.
# TYPE base_cpu_processCpuLoad gauge
base_cpu_processCpuLoad 7.754197217649737E-4
# HELP base_memory_committedNonHeap_bytes Displays the amount of memory that is committed for the Java virtual machine to use.
# TYPE base_memory_committedNonHeap_bytes gauge
...

Helm deployed using

clusterAgent:
  enabled: true
  token: <TOKEN>
datadog:
  apiKey: <KEY>
  apm:
    enabled: true
  clusterChecks:
    enabled: true
  dogstatsd:
    nonLocalTraffic: true
    useHostPort: true
  kubeStateMetricsEnabled: false
  kubeStateMetricsCore:
    enabled: true
    ignoreLegacyKSMCheck: true
  logs:
    enabled: false
  processAgent:
    enabled: true
    processCollection: true
  site: datadoghq.com
  tags:
  - cluster_name:<CLUSTER_NAME>
  - env:<STACK_NAME>-<STACK_TYPE>
  - stack_name:<STACK_NAME>
  - stack_type:<STACK_TYPE>
registry: public.ecr.aws/datadog

Describe what you expected: Metrics are collected correctly

Steps to reproduce the issue: Upgrading from datadog chart 2.10.3 to 2.18.0, 2.18.1, 2.18.2 and finally 2.18.3 Following the guides

Additional environment details (Operating System, Cloud provider, etc): EKS AWS 1.20

carlosjgp commented 3 years ago

I've created a case with support and added a "flare" file to it (522689)

carlosjgp commented 3 years ago

In case you come across this ticket...

So it happens that the collection of metrics from Prometheus endpoints does not support the wildcard by itself and I configured above

           "metrics": [
             "*"
           ]

This seems to be a safeguard to avoid DD users finding at the end of the month a very inflated invoice. This is because DD charges extra for custom metrics

If a metric is not submitted from one of the more than 450 Datadog integrations it’s considered a custom metric(1).

leaving me to have to implement something like

           "metrics": [
             "a*",
             "b*",
             "c*",
             "d*",
...
             "z*"
           ]

to be able to ingest all the metrics from this deployment

andrewloux commented 2 years ago

god bless you @carlosjgp 😂, was using a wildcard ".*" in a similar fashion, and spinning around trying to understand why metric samples were held at 0. this behaviour should be documented