DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

Issues/regressions migrating from agent 6 to agent 7 #5418

Open KarthikRangaraju opened 4 years ago

KarthikRangaraju commented 4 years ago

Output of the info page (if this is a bug)

agent status
Getting the status from the agent.

===============
Agent (v7.18.1)
===============

  Status date: 2020-04-27 18:55:55.977281 UTC
  Agent start: 2020-04-27 18:32:05.470769 UTC
  Pid: 370
  Go Version: go1.12.9
  Python Version: 3.8.1
  Build arch: amd64
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2020-04-27 18:55:55.977281 UTC

  Host Info
  =========
    bootTime: 2020-02-05 22:21:02.000000 UTC
    kernelVersion: 5.5.0-1.el7.elrepo.x86_64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: bullseye/sid
    procs: 73
    uptime: 1964h11m9s
    virtualizationRole: guest
    virtualizationSystem: docker

  Hostnames
  =========
    host_aliases: [den-iac-opstest-kube-node01.ops-test.clh-int.com]
    hostname: den-iac-opstest-kube-node01.ops-test.clh-int.com
    socket-fqdn: 7b78fcc56df2
    socket-hostname: 7b78fcc56df2
    host tags:
      environment:ops-test
      owner:iac
      agent_type:node
      docker_swarm_node_role:manager
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

  Metadata
  ========
    hostname_source: container

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 6, Total: 564
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:41.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:41.000000 UTC

    disk (2.7.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 640, Total: 60,800
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 249ms
      Last Execution Date : 2020-04-27 18:55:48.000000 UTC
      Last Successful Execution Date : Never
      Error: not sure how to interpret line '   8       0 sda 45066 9000 10004532 512126 138429438 193746244 3432370168 440317261 0 127455390 368574478 0 0 0 0 0 0\n'
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 713, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/disk/disk.py", line 121, in check
          self.collect_latency_metrics()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/disk/disk.py", line 244, in collect_latency_metrics
          for disk_name, disk in iteritems(psutil.disk_io_counters(True)):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/__init__.py", line 2168, in disk_io_counters
          rawdict = _psplatform.disk_io_counters(**kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/_pslinux.py", line 1125, in disk_io_counters
          for entry in gen:
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/_pslinux.py", line 1098, in read_procfs
          raise ValueError("not sure how to interpret line %r" % line)
      ValueError: not sure how to interpret line '   8       0 sda 45066 9000 10004532 512126 138429438 193746244 3432370168 440317261 0 127455390 368574478 0 0 0 0 0 0\n'

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 355, Total: 33,685
      Events: Last Run: 0, Total: 3
      Service Checks: Last Run: 1, Total: 95
      Average Execution Time : 166ms
      Last Execution Date : 2020-04-27 18:55:55.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:55.000000 UTC

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 5, Total: 475
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:47.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:47.000000 UTC

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 52, Total: 4,904
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:54.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:54.000000 UTC

    kubelet (3.6.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 19, Total: 1,805
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 380
      Average Execution Time : 7.356s
      Last Execution Date : 2020-04-27 18:55:53.000000 UTC
      Last Successful Execution Date : Never
      Error: HTTPSConnectionPool(host='169.254.1.1', port=10250): Max retries exceeded with url: /metrics/cadvisor (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 503 Service Unavailable')))
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 662, in urlopen
          self._prepare_proxy(conn)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 948, in _prepare_proxy
          conn.connect()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 308, in connect
          self._tunnel()
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 898, in _tunnel
          raise OSError("Tunnel connection failed: %d %s" % (code,
      OSError: Tunnel connection failed: 503 Service Unavailable

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
          resp = conn.urlopen(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 719, in urlopen
          retries = retries.increment(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 436, in increment
          raise MaxRetryError(_pool, url, error or ResponseError(cause))
      urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='169.254.1.1', port=10250): Max retries exceeded with url: /metrics/cadvisor (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 503 Service Unavailable')))

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 713, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py", line 349, in check
          self.process(self.cadvisor_scraper_config, metric_transformers=self.transformers)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 443, in process
          for metric in self.scrape_metrics(scraper_config):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 401, in scrape_metrics
          response = self.poll(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 605, in poll
          response = self.send_request(endpoint, scraper_config, headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 631, in send_request
          return http_handler.get(endpoint, stream=True, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 277, in get
          return self._request('get', url, options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 319, in _request
          return getattr(requests, method)(url, **new_options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 510, in send
          raise ProxyError(e, request=request)
      requests.exceptions.ProxyError: HTTPSConnectionPool(host='169.254.1.1', port=10250): Max retries exceeded with url: /metrics/cadvisor (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 503 Service Unavailable')))

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 6, Total: 570
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:53.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:53.000000 UTC

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 17, Total: 1,615
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:45.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:45.000000 UTC

    network (1.14.0)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 37, Total: 3,329
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 7ms
      Last Execution Date : 2020-04-27 18:55:52.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:52.000000 UTC

    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 2
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 2
      Average Execution Time : 20.026s
      Last Execution Date : 2020-04-27 18:47:31.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:47:31.000000 UTC

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 95
      Metric Samples: Last Run: 1, Total: 95
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-04-27 18:55:44.000000 UTC
      Last Successful Execution Date : 2020-04-27 18:55:44.000000 UTC

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 95
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 12
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 202
    TimeseriesV1: 95

  API Keys status
  ===============
    API key ending with e724a: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - e724a

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 109,745
  Dogstatsd Metric Sample: 6,786
  Event: 4
  Events Flushed: 4
  Number Of Flushes: 95
  Series Flushed: 54,308
  Service Check: 1,429
  Service Checks Flushed: 1,519

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 6,785
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 435,340
  Udp Packet Reading Errors: 0
  Udp Packets: 6,786
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://datadog-cluster-agent.ops-test.kube.ch.int
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.5.2+commit.60ee741

Describe what happened: When I upgraded my datadog agent from 6.13.0 to 7.18.1 kubelet and kubernetes_state checks started failing with 503 Service Unavailable python errors. When I revert back to 6.13.0 the checks work fine. I also see disk checks failing which is broken in 6.13.0 too.

Describe what you expected: kubelet, disk and kubernetes_state check to be OK

Steps to reproduce the issue: Configure kubelet and kubernetes_state checks on agent 7.18.1.

Additional environment details (Operating System, Cloud provider, etc):

Using datadog agent docker image version 7.18.1. 

Linux 7b78fcc56df2 5.5.0-1.el7.elrepo.x86_64 #1 SMP Sun Jan 26 20:12:30 EST 2020 x86_64 GNU/Linux
KarthikRangaraju commented 4 years ago

The issue was on our end:

We had to set proxy.no_proxy to exclude the hosts that are not routable via proxy. Then it worked.