DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

Error on kube DNS check #1514

Open epinzur opened 6 years ago

epinzur commented 6 years ago

Output of the info page (if this is a bug)

Getting the status from the agent.

===================
Agent (v6.1.0-rc.2)
===================

  Status date: 2018-03-23 14:40:21.520992 UTC
  Pid: 358
  Python Version: 2.7.14
  Logs:
  Check Runners: 1
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -0.002220975 s
    System UTC time: 2018-03-23 14:40:21.520992 UTC

  Host Info
  =========
    bootTime: 2018-02-01 19:59:49.000000 UTC
    kernelVersion: 4.14.11-coreos
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.4
    procs: 74
    uptime: 4.224223e+06
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-10-50-47-76.ec2.internal
    hostname: ip-10-50-47-76.ec2.internal
    instance-id: i-01b9dc4163d5bb571
    socket-fqdn: datadog-agent-pngwq
    socket-hostname: datadog-agent-pngwq

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 5107
      Metrics: 6, Total Metrics: 30636
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    disk
    ----
      Total Runs: 5107
      Metrics: 170, Total Metrics: over 100K
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    docker
    ------
      Total Runs: 5107
      Metrics: 333, Total Metrics: over 1M
      Events: 0, Total Events: 16
      Service Checks: 1, Total Service Checks: 5107

    file_handle
    -----------
      Total Runs: 5107
      Metrics: 1, Total Metrics: 5107
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    io
    --
      Total Runs: 5107
      Metrics: 130, Total Metrics: over 100K
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    kube_dns
    --------
      Total Runs: 5107
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0Error: HTTPConnectionPool(host='10.101.56.17', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d5d9dd50>: Failed to establish a new connection: [Errno 111] Connection refused',))
      Traceback (most recent call last):
        File "/opt/datadog-agent/bin/agent/dist/checks/__init__.py", line 332, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kube_dns/kube_dns.py", line 47, in check
          self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 350, in process
          for metric in self.scrape_metrics(endpoint):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 314, in scrape_metrics
          response = self.poll(endpoint)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 467, in poll
          response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 72, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 58, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
          raise ConnectionError(e, request=request)
      ConnectionError: HTTPConnectionPool(host='10.101.56.17', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d5d9dd50>: Failed to establish a new connection: [Errno 111] Connection refused',))

    kube_dns
    --------
      Total Runs: 5107
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0Error: HTTPConnectionPool(host='10.101.56.13', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d5d9dc50>: Failed to establish a new connection: [Errno 111] Connection refused',))
      Traceback (most recent call last):
        File "/opt/datadog-agent/bin/agent/dist/checks/__init__.py", line 332, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kube_dns/kube_dns.py", line 47, in check
          self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 350, in process
          for metric in self.scrape_metrics(endpoint):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 314, in scrape_metrics
          response = self.poll(endpoint)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 467, in poll
          response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 72, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py", line 58, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
          raise ConnectionError(e, request=request)
      ConnectionError: HTTPConnectionPool(host='10.101.56.13', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d5d9dc50>: Failed to establish a new connection: [Errno 111] Connection refused',))

    kubelet
    -------
      Total Runs: 5107
      Metrics: 60, Total Metrics: over 100K
      Events: 0, Total Events: 0
      Service Checks: 3, Total Service Checks: 15321Error: 404 Client Error: Not Found for url: http://ip-10-50-47-76.ec2.internal:10255/metrics/cadvisor
      Traceback (most recent call last):
        File "/opt/datadog-agent/bin/agent/dist/checks/__init__.py", line 332, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py", line 128, in check
          self.process(self.metrics_url, send_histograms_buckets=send_buckets, instance=instance)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 350, in process
          for metric in self.scrape_metrics(endpoint):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 314, in scrape_metrics
          response = self.poll(endpoint)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py", line 480, in poll
          response.raise_for_status()
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/models.py", line 935, in raise_for_status
          raise HTTPError(http_error_msg, response=self)
      HTTPError: 404 Client Error: Not Found for url: http://ip-10-50-47-76.ec2.internal:10255/metrics/cadvisor

    kubernetes_apiserver
    --------------------
      Total Runs: 5107
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    load
    ----
      Total Runs: 5107
      Metrics: 6, Total Metrics: 30642
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    memory
    ------
      Total Runs: 5107
      Metrics: 14, Total Metrics: 71498
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    network
    -------
      Total Runs: 5107
      Metrics: 128, Total Metrics: over 100K
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    ntp
    ---
      Total Runs: 5107
      Metrics: 1, Total Metrics: 4981
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 5107

    uptime
    ------
      Total Runs: 5107
      Metrics: 1, Total Metrics: 5107
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 5107
  IntakeV1: 404
  RetryQueueSize: 0
  Success: 10618
  TimeseriesV1: 5107

  API Keys status
  ===============
    https://6-1-0-app.agent.datadoghq.com,*************************709d5: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 4.472034e&#43;06
  Event: 17
  Events Flushed: 17
  Number Of Flushes: 5107
  Series Flushed: 3.873095e&#43;06
  Service Check: 97033
  Service Checks Flushed: 102121
  Dogstatsd Metric Sample: 49800

Describe what happened:

Error is getting logged when doing running the kube_dns check. We have about 80 nodes in our cluster, and each pod in the daemonset is logging this about 10 times a minute. (The error seems to pop twice on each attempt).

[ AGENT ] 2018-03-23 14:21:19 UTC | ERROR | (runner.go:276 in work) | Error running check kube_dns: [{"message": "HTTPConnectionPool(host='10.101.56.13', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d651b410>: Failed to establish a new connection: [Errno 111] Connection refused',))", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kube_dns/kube_dns.py\", line 47, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 350, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 314, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 467, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 508, in send\n    raise ConnectionError(e, request=request)\nConnectionError: HTTPConnectionPool(host='10.101.56.13', port=10055): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94d651b410>: Failed to establish a new connection: [Errno 111] Connection refused',))\n"}]

Describe what you expected:

Check to work without issue.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc): Running Kubernetes 1.6.8 on AWS EC2 CoreOS 1576.5.0 instances, Docker version 17.09.0-ce, build afdb6d4, using agent Agent (v6.1.0-rc.2)

The v6.1.0 release candidates are the first agent 6's that run without constantly crashing on our cluster. Though I did need to disable log collection, and I never tried that on any of the 6.0 or 6.beta releases.

CharlyF commented 6 years ago

Hey @rltvty, thanks for opening this issue.

The first thing that comes to my mind is that the kube-dns deployment is exposing several ports and the /metrics is not available on 10055.

Now, in our discovery, we get the highest port value: https://github.com/DataDog/datadog-agent/blob/master/pkg/collector/autodiscovery/configresolver.go#L320-L341 But this can be configured in the template (with %%port_0%% or which ever port you want the agent to listen to). As documented here: https://docs.datadoghq.com/agent/autodiscovery/#template-variable-indexes The template we use to autodiscover is in the auto_conf folder, but you can supersed it by using the annotations in the deployment of kubedns so the agent knows which port to listen to.

Looking at available manifests it seems that the highest port is the 10055 and it's the one exposing the metrics. So the above lead might not be correct.

If you confirm that there is only one port and that you can curl the 10.101.56.17:10055/metrics or 10.101.56.13:10055/metrics from inside the agent's pod then please send over a flare to our support team so we can better investigate.

Secondly, I can see that you are running Kubernetes 1.6.8. Unfortunately, the agent 6 only works with kubernetes1.7.6+ As we rely on the /metrics of the kubelet to collect the kubernetes metrics. This new endpoint was first introduced in 1.7.6.

To this extent, the kubelet integration, even if the agent is provided with the kubelet ip via the downward API as an env var, will not work.

On a side note, thank you for testing the rc versions of our agent! If you can share more feedback on the issues you had with the 6.0 or getting the logs that would be fantastic too.

ghost commented 6 years ago

Adding to this as I am running into the same issue with AKS. After some digging and work with datadog support, AKS has kube DNS running under port 10053, and annotations dont seem to be working to have the agent kube dns config updated. Currently no fix.

ellieayla commented 6 years ago

I believe that I have run into the same problem as @rltvty, also on AKS, but with Kubernetes v1.10.6.

$ kubectl get pod -l 'k8s-app=kube-dns' -n kube-system -o custom-columns=port:.spec.containers[*].ports
port
[map[protocol:UDP containerPort:10053 name:dns-local] map[containerPort:10053 name:dns-tcp-local protocol:TCP]],[map[containerPort:53 name:dns protocol:UDP] map[name:dns-tcp protocol:TCP containerPort:53]],[map[containerPort:8080 protocol:TCP]]
[map[containerPort:10053 name:dns-local protocol:UDP] map[containerPort:10053 name:dns-tcp-local protocol:TCP]],[map[containerPort:53 name:dns protocol:UDP] map[name:dns-tcp protocol:TCP containerPort:53]],[map[containerPort:8080 protocol:TCP]]

Azure AKS' stock kube-dns doesn't resemble https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/deployments/kube-dns.yaml. Container port 10053 is the highest numbered listening port, and it's handling DNS requests. Container port 8080 is the HTTP /healthz endpoint. I believe there is no /metrics endpoint on this kube-dns.

describe-deployment-kube-dns-v20-aks.txt

describe-pod-kube-dns-v20-aks.txt

I wonder whether the Kubernetes server version matters here, compared to the revision of deployment.apps/kube-dns-v20. Or whether Azure has tweaked something different from what datadog-agent expects.