DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

Agent cannot get Kubernetics metrics #5165

Closed brpaz closed 4 years ago

brpaz commented 4 years ago

Output of the info page (if this is a bug)

===============
Agent (v7.17.1)
===============

  Status date: 2020-03-21 17:06:26.041825 UTC
  Agent start: 2020-03-21 16:54:02.487181 UTC
  Pid: 1
  Go Version: go1.12.9
  Python Version: 3.7.6
  Build arch: amd64
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -307µs
    System UTC time: 2020-03-21 17:06:26.041825 UTC

  Host Info
  =========
    bootTime: 2020-03-15 11:02:13.000000 UTC
    kernelVersion: 4.19.0-0.bpo.6-amd64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: bullseye/sid
    procs: 57
    uptime: 149h51m54s
    virtualizationRole: host
    virtualizationSystem: kvm

  Hostnames
  =========
    host_aliases: [default-1isj]
    hostname: default-1isj
    socket-fqdn: datadog-r9js9
    socket-hostname: datadog-r9js9
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========
    hostname_source: container

=========
Collector
=========

  Running Checks
  ==============

    coredns (1.3.0)
    ---------------
      Instance ID: coredns:1c6ab5b610895c78 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/coredns.d/auto_conf.yaml
      Total Runs: 50
      Metric Samples: Last Run: 146, Total: 7,300
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 50
      Average Execution Time : 40ms

      Instance ID: coredns:94124f5626265a24 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/coredns.d/auto_conf.yaml
      Total Runs: 49
      Metric Samples: Last Run: 146, Total: 7,154
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 49
      Average Execution Time : 70ms

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 6, Total: 288
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    disk (2.6.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 176, Total: 8,624
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 35ms

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 658, Total: 32,602
      Events: Last Run: 0, Total: 3
      Service Checks: Last Run: 1, Total: 49
      Average Execution Time : 119ms

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 5, Total: 245
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 52, Total: 2,512
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    kubelet (3.5.2)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 693, Total: 34,477
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 196
      Average Execution Time : 655ms

    kubernetes_state (5.1.0)
    ------------------------
      Instance ID: kubernetes_state:a93d366434448fb6 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_state.d/auto_conf.yaml
      Total Runs: 49
      Metric Samples: Last Run: 549, Total: 26,592
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 192
      Average Execution Time : 123ms

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 6, Total: 294
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 50
      Metric Samples: Last Run: 17, Total: 850
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    network (1.14.0)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 49
      Metric Samples: Last Run: 31, Total: 1,519
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 4ms

    nginx (3.6.0)
    -------------
      Instance ID: nginx:68ddc46d1d362075 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/nginx.yaml
      Total Runs: 49
      Metric Samples: Last Run: 7, Total: 343
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 49
      Average Execution Time : 8ms
      metadata:
        version.major: 1
        version.minor: 17
        version.patch: 8
        version.raw: 1.17.8
        version.scheme: semver

    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 1, Total: 1
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 74ms

    postgres (3.5.0)
    ----------------
      Instance ID: postgres:f6f66762f46f9424 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/postgres.yaml
      Total Runs: 49
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 49
      Average Execution Time : 11ms
      Error: FATAL:  password authentication failed for user "datadog"

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/base.py", line 673, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/postgres/postgres.py", line 728, in check
          self._connect(host, port, user, password, dbname, ssl, tags)
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/postgres/postgres.py", line 538, in _connect
          application_name="datadog-agent",
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/psycopg2/__init__.py", line 126, in connect
          conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
      psycopg2.OperationalError: FATAL:  password authentication failed for user "datadog"

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 50
      Metric Samples: Last Run: 1, Total: 50
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 49
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 8
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 106
    TimeseriesV1: 49

  API Keys status
  ===============
    API key ending with 365e1: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 365e1

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 126,607
  Dogstatsd Metric Sample: 1
  Event: 4
  Events Flushed: 4
  Number Of Flushes: 49
  Series Flushed: 111,872
  Service Check: 1,397
  Service Checks Flushed: 1,428

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 0
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 1
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened:

Datadog agent is not monitoring Kubernetes correctly. I am unable to see any container info on "containers" page on Datadog and it says "No agent is reporting metric"

Describe what you expected:

I should see the metrics from my containers.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

Error log

 ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check postgres: [{"message": "FATAL:  password authentication failed for user \"datadog\"\n", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/base.py\", line 673, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/postgres/postgres.py\", line 728, in check\n    self._connect(host, port, user, password, dbname, ssl, tags)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/postgres/postgres.py\", line 538, in _connect\n    application_name=\"datadog-agent\",\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/psycopg2/__init__.py\", line 126, in connect\n    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\npsycopg2.OperationalError: FATAL:  password authentication failed for user \"datadog\"\n\n"}]
2020-03-21 16:55:45 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | Running check kubernetes_state
2020-03-21 16:55:46 UTC | CORE | INFO | (pkg/collector/scheduler/scheduler.go:115 in Cancel) | Unscheduling check kubernetes_state:cdac589ca911be77
2020-03-21 16:55:55 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='10.244.0.39', port=8080): Max retries exceeded with url: /metrics (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f331da22490>, 'Connection to 10.244.0.39 timed out. (connect timeout=10.0)'))", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 157, in _new_conn\n    (self._dns_host, self.port), self.timeout, **extra_kw\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/connection.py\", line 84, in create_connection\n    raise err\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/connection.py\", line 74, in create_connection\n    sock.connect(sa)\nsocket.timeout: timed out\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 672, in urlopen\n    chunked=chunked,\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 387, in _make_request\n    conn.request(method, url, **httplib_request_kw)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/http/client.py\", line 1252, in request\n    self._send_request(method, url, body, headers, encode_chunked)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/http/client.py\", line 1298, in _send_request\n    self.endheaders(body, encode_chunked=encode_chunked)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/http/client.py\", line 1247, in endheaders\n    self._send_output(message_body, encode_chunked=encode_chunked)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/http/client.py\", line 1026, in _send_output\n    self.send(msg)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/http/client.py\", line 966, in send\n    self.connect()\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 184, in connect\n    conn = self._new_conn()\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 164, in _new_conn\n    % (self.host, self.timeout),\nurllib3.exceptions.ConnectTimeoutError: (, 'Connection to 10.244.0.39 timed out. (connect timeout=10.0)')\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/adapters.py\", line 449, in send\n    timeout=timeout\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 720, in urlopen\n    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/retry.py\", line 436, in increment\n    raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.244.0.39', port=8080): Max retries exceeded with url: /metrics (Caused by ConnectTimeoutError(, 'Connection to 10.244.0.39 timed out. (connect timeout=10.0)'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/base.py\", line 673, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 133, in check\n    self.process(scraper_config, metric_transformers=self.METRIC_TRANSFORMERS)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 419, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 377, in scrape_metrics\n    response = self.poll(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 577, in poll\n    response = self.send_request(endpoint, scraper_config, headers)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 603, in send_request\n    return http_handler.get(endpoint, stream=True, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/utils/http.py\", line 274, in get\n    return self._request('get', url, options)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/utils/http.py\", line 309, in _request\n    return getattr(requests, method)(url, **self.populate_options(options))\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/api.py\", line 75, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/api.py\", line 60, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/sessions.py\", line 533, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/sessions.py\", line 646, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/adapters.py\", line 504, in send\n    raise ConnectTimeout(e, request=request)\nrequests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.244.0.39', port=8080): Max retries exceeded with url: /metrics (Caused by ConnectTimeoutError(, 'Connection to 10.244.0.39 timed out. (connect timeout=10.0)'))\n"}]
DylanLovesCoffee commented 4 years ago

Hey @brpaz, in regards to the "containers page" are you referring to the Live Containers view? The data on that page is collected via the process-agent. You can enable the process-agent using DD_PROCESS_AGENT_ENABLED=true or processAgent.enabled=true - in the latest chart version (currently looking at v2.1.1) we have this defaulting to true.

brpaz commented 4 years ago

@DylanLovesCoffee yes that´s what I was referring to. ah that explains why updating the Chart to the latest version fixed my issue.

Thanks