DataDog / docker-dd-agent

Datadog Agent Dockerfile for Trusted Builds.
https://registry.hub.docker.com/u/datadog/docker-dd-agent/
MIT License
298 stars 189 forks source link

[kubernetes] Collector cannot verify tls: CERTIFICATE_VERIFY_FAILED #288

Open poswald opened 6 years ago

poswald commented 6 years ago

When deploying to an IBM Cloud cluster (kubernetes 1.9) the Datadog collector does not work. This has reported to Datadog support as ticket #129722

It is still not clear to me if this is an issue with IBM or with Datadog.

There is a workaround: after disabling TLS verification kubelet_tls_verify in /etc/dd-agent/conf.d/kubernetes.yaml it connects ok however, we cannot run this way in production. The bearer token path /var/run/secrets/kubernetes.io/serviceaccount/token appears to be populated ok:

root@dd-agent-mbshp:/# ls -al /var/run/secrets/kubernetes.io/serviceaccount/token
lrwxrwxrwx 1 root root 12 Feb 20 03:01 /var/run/secrets/kubernetes.io/serviceaccount/token -> ..data/token

Output of the info page

# /etc/init.d/datadog-agent info
2018-02-20 04:02:23,039 | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 04:02:23,345 | DEBUG | dd.collector | utils.cloud_metadata(cloud_metadata.py:77) | Collecting GCE Metadata failed HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/?recursive=true (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x2b39fa0afa50>, 'Connection to 169.254.169.254 timed out. (connect timeout=0.3)'))
2018-02-20 04:02:23,348 | DEBUG | dd.collector | docker.auth.auth(auth.py:227) | Trying paths: ['/root/.docker/config.json', '/root/.dockercfg']
2018-02-20 04:02:23,348 | DEBUG | dd.collector | docker.auth.auth(auth.py:234) | No config file found
====================
Collector (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:21 (2s ago)
  Pid: 38
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log

  Clocks
  ======

    NTP offset: -0.0047 s
    System UTC time: 2018-02-20 04:02:23.861592

  Paths
  =====

    conf.d: /etc/dd-agent/conf.d
    checks.d: Not found

  Hostnames
  =========

    socket-hostname: dd-agent-mbshp
    hostname: kube-tok02-crbf29e27a18ff4db58ff3873f3c748f61-w1.cloud.ibm
    socket-fqdn: dd-agent-mbshp

  Checks
  ======

    ntp (1.0.0)
    -----------
      - Collected 0 metrics, 0 events & 0 service checks

    disk (1.1.0)
    ------------
      - instance #0 [OK]
      - Collected 52 metrics, 0 events & 0 service checks

    network (1.4.0)
    ---------------
      - instance #0 [OK]
      - Collected 50 metrics, 0 events & 0 service checks

    docker_daemon (1.8.0)
    ---------------------
      - instance #0 [OK]
      - Collected 216 metrics, 0 events & 1 service check

    kubernetes (1.5.0)
    ------------------
      - initialize check class [ERROR]: Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.',)

  Emitters
  ========

    - http_emitter [OK]

2018-02-20 04:02:26,458 | DEBUG | dd.dogstatsd | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
====================
Dogstatsd (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:20 (6s ago)
  Pid: 27
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log

  Flush count: 358
  Packet Count: 1980
  Packets per second: 2.0
  Metric count: 20
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:30 (0s ago)
  Pid: 20
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log

  Queue Size: 611 bytes
  Queue Length: 1
  Flush Count: 1110
  Transactions received: 883
  Transactions flushed: 882
  Transactions rejected: 0
  API Key Status: API Key is valid

======================
Trace Agent (v 5.22.0)
======================

  Pid: 18
  Uptime: 3642 seconds
  Mem alloc: 958344 bytes

  Hostname: dd-agent-mbshp
  Receiver: 0.0.0.0:8126
  API Endpoint: https://trace.agent.datadoghq.com

  --- Receiver stats (1 min) ---

  --- Writer stats (1 min) ---

  Traces: 0 payloads, 0 traces, 0 bytes
  Stats: 0 payloads, 0 stats buckets, 0 bytes
  Services: 0 payloads, 0 services, 0 bytes

Output of the collector log:

root@dd-agent-mbshp:/# head -100 /var/log/datadog/collector.log
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.cloud_metadata(cloud_metadata.py:77) | Collecting GCE Metadata failed HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/?recursive=true (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x2ab4b4a3aad0>, 'Connection to 169.254.169.254 timed out. (connect timeout=0.3)'))
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | docker.auth.auth(auth.py:227) | Trying paths: ['/root/.docker/config.json', '/root/.dockercfg']
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | docker.auth.auth(auth.py:234) | No config file found
2018-02-20 03:02:15 UTC | INFO | dd.collector | utils.pidfile(pidfile.py:35) | Pid file is: /opt/datadog-agent/run/dd-agent.pid
2018-02-20 03:02:15 UTC | INFO | dd.collector | collector(agent.py:559) | Agent version 5.22.0
2018-02-20 03:02:15 UTC | INFO | dd.collector | daemon(daemon.py:234) | Starting
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | checks.check_status(check_status.py:163) | Persisting status to /opt/datadog-agent/run/CollectorStatus.pickle
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.subprocess_output(subprocess_output.py:54) | Popen(['grep', 'model name', '/host/proc/cpuinfo'], stderr = <open file '<fdopen>', mode 'w+b' at 0x2ab4b4a09780>, stdout = <open file '<fdopen>', mode 'w+b' at 0x2ab4b4a096f0>) called
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | collector(kubeutil.py:264) | Couldn't query kubelet over HTTP, assuming it's not in no_auth mode.
2018-02-20 03:02:16 UTC | WARNING | dd.collector | collector(kubeutil.py:273) | Couldn't query kubelet over HTTP, assuming it's not in no_auth mode.
2018-02-20 03:02:16 UTC | ERROR | dd.collector | collector(kubeutil.py:209) | Failed to initialize kubelet connection. Will retry 0 time(s). Error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)
2018-02-20 03:02:16 UTC | INFO | dd.collector | config(config.py:998) | no bundled checks.d path (checks provided as wheels): /opt/datadog-agent/agent/checks.d
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1012) | No sdk integrations path found
2018-02-20 03:02:16 UTC | ERROR | dd.collector | config(config.py:1076) | Unable to initialize check kubernetes
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/config.py", line 1060, in _initialize_check
    agentConfig=agentConfig, instances=instances)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes/kubernetes.py", line 106, in __init__
    raise Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.')
Exception: Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded kubernetes
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded docker_daemon
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded ntp
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded disk
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded network
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded agent_metrics
2018-02-20 03:02:16 UTC | INFO | dd.collector | config(config.py:973) | Fetching service discovery check configurations.
2018-02-20 03:02:16 UTC | ERROR | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:123) | kubelet client not initialized, cannot retrieve pod list.

Steps to reproduce the issue:

  1. Create a new cluster
  2. Deploy the dd-agent.yml as documented

Describe the results you received:

Infrastructure host info in Datadog console reports:

Datadog's kubernetes integration is reporting:
Instance #initialization[ERROR]:Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.',)
antoinepouille commented 6 years ago

Hi,

Thanks for reaching out about this issue! It seems like we are already investigating this on one of our support tickets, so we will take back investigation and be coming back to you through this ticket.

Regards

poswald commented 6 years ago

Based on your documentation I'm trying to configure the certificates.

This error message is quite misleading because this seems to be the real error:

2018-02-20 03:02:16 UTC | ERROR | dd.collector | collector(kubeutil.py:209) | Failed to initialize kubelet connection. Will retry 0 time(s). Error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)

But that's not the one you end up showing to the user.

Doing something like this from the agent's pod gives a warning about self-signed certificates...

root@dd-agent-5rp7j:/# curl -v  --cacert /host/etc/kubernetes/cert/kubelet.pem     https://10.132.135.141:10250/healthz

I think that's basically the same as the error message being reported by the python/requests library.

I have no idea what pem file I could pass to it to make it happy.

antoinepouille commented 6 years ago

This may be caused by the way Bluemix is signing certificates. They may not be signed correctly to allow access to the kubelet from the agent's pod. We will have to reproduce the setup, and check the validity of the certificates there. Please bear with us while we do so. If we can make sure of this, we will then contact them.

We will continue to update you about this in the support ticket we have been using.

irabinovitch commented 6 years ago

We've shared this issue with the IBM Cloud team. Hopefully they'll chime in soon as well.

xvello commented 6 years ago

If running inside a pod, the kubelet's certificate is validated against the cluster root CA, mounted in /var/run/secrets/kubernetes.io/serviceaccount/ca.crt. Could you try to validate this with curl?

irabinovitch commented 6 years ago

@poswald Did @xvello 's suggestion work?

JulienBalestra commented 6 years ago

Thanks for you patience while we are looking into this!

The problem can occur from different sources:

The Pod /run/secrets/kubernetes.io/serviceaccount/ca.crt is provided by the hyperkube controller-manager --root-ca-file flag.

It's not mandatory for the kubelet to use certificates from the same PKI as the controller-manager.

The kubelet can be started with different configurations:

1. TLS: By setting the kubelet's appropriated flags --tls-cert-file and --tls-private-key-file to certs issued from various PKI. If the PKI is the same as the --root-ca-file everything is fine. Otherwise you must provide an additional certificate authority to establish a HTTPS connection with the kubelet.

2. Self-signed: If none of --tls-cert-file and --tls-private-key-file are set, the kubelet will generate self-signed certificates into the --cert-dir: /var/lib/kubelet/pki by default.

As a note: Kubernetes CSR: Set the kubelet --bootstrap-kubeconfig flag to request a client certificate from the API server and then store them to the --cert-dir. The kubelet will submit a CSR and when it's approved, the certificates will be issued by the controller-manager. The controller-manager will use the flags --cluster-signing-cert-file and --cluster-signing-key-file and they can be are different from the --root-ca-file.

This is added as a note as it manages only communication between the kubelet client and the API server, not between te kubelet server API (e.g. /pods) and the pods client.


In order to check if your configuration correctly matches one of those two, you can run this:

curl -v --cacert /run/secrets/kubernetes.io/serviceaccount/ca.crt \
  https://${status.nodeIP}:10250 \
  -H "Authorization: Bearer $(< /run/secrets/kubernetes.io/serviceaccount/token)"

If this doesn't work, then most probably the kubelet configuration is not correct to allow access from the agent pod.

Aside of this configuration matters, during our investigation, we stumbled into an issue with Python requests version 2.11.1:

verify = "/path/to/issuing_ca"
r = requests.get("https://192.168.1.1:10250/pods", verify=verify, cert=None)

will produce the following stack trace:

...
requests.exceptions.SSLError: HTTPSConnectionPool(host='192.168.1.1', port=10250): Max retries exceeded with url: /pods (Caused by SSLError(CertificateError("hostname '192.168.1.1' doesn't match either of 'e2e', '192.168.254.1', '127.0.0.1', '192.168.1.1'",),))

(With the following details in the CA:CN=e2e and X509v3 Subject Alternative Name: DNS:e2e)

Upgrading the library solves this issue, so we will at least work to provide a fix for this. This could actually be part of the issue you are encountering.

We will keep looking into this and be updating you.

antoinepouille commented 6 years ago

After investigating the different certificates provided by Bluemix clusters, it seems like the certificate you need to access the kubernetes is located on the node in /var/lib/kubelet/pki/kubelet.crt. It is a self-signed certificate (as in case 2 of the previous message), so you will have to mount it in the agent to authenticate with the kubelet. You can check that it's the correct one by running on the host curl https://127.0.0.1:10250/healthz --cacert /var/lib/kubelet/pki/kubelet.crt: you should get an "unauthorized", which shows the certificate is accepted.

However, testing this with the agent is failing because of the already mentioned issue with the requests library which is bundled with the agent. The issue lies with an error handling the Subject Alternate Name that has to be used here. We are currently working on updating this library in the agent to fix this issue. Hopefully, this will finally allow this setup to work. Don't hesitate to ask if you have further questions regarding this matter!

poswald commented 6 years ago

I'll be honest, I haven't tried as I had basically given up on it as I just didn't have enough visibility into how the IBM servers were set up to figure it out. Thank you for following up on this.

I'll give this another shot.

I was looking and I realized that there is a helm chart so perhaps I'll give that a go as well, although I suspect I'll have to use my own hand-created yaml file to get the host mounts. You might want to look into making sure that thing knows how to mount the certs into the agent pod as well.

I'll keep an eyes peeled for a release that closes this issue.

antoinepouille commented 6 years ago

@poswald Indeed, even if you move to the helm chart, you would need to specify manually the mount for the kubelet certificate. This should be pretty standard configuration, so please ask if you need help with that.

Actually, given that requests issue, I would advise you to wait for the next release of the agent, which will update the library, before starting to try it again as it is deemed to fail until then.

irabinovitch commented 6 years ago

@antoinepouille Were you referring to 6.1?

JulienBalestra commented 6 years ago

@irabinovitch The datadog/agent:6.1.0 release solved the SSL issue with the upgrade of python requests (2.18.4).

antoinepouille commented 6 years ago

@poswald Agent 6.1.0 is now out. Could you test out the kubelet check with the new details we provided you? Please tell us if you still encounter issues then.

mhulscher commented 6 years ago

I have the same issue. I just deployed agent 6.1.0 but the check breaks on not being able to verify the secure port's CA cert. This makes sense because the kubelet's server certificate is self-signed. Is there an option (envvar) I can set to ignore CA verification?

[ AGENT ] 2018-03-27 12:34:15 UTC | ERROR | (autoconfig.go:446 in collect) | Unable to collect configurations from provider Kubernetes pod annotation: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.77.0.54:10250/pods: x509: cannot validate certificate for 10.77.0.54 because it doesn't contain any IP SANs", http: "Get http://10.77.0.54:10255/pods: dial tcp 10.77.0.54:10255: getsockopt: connection refused"
antoinepouille commented 6 years ago

@mhulscher Are you using IBM cloud? The certificate setup is different according to your Kubernetes distribution. You should be able to disable it using this option, but we are working on a bug related to that option so it may not work until a next version of the agent. If you still have issues with that, please open up a case with our support where we will be able to help you, using your logs and config.

Cheers

antoinepouille commented 6 years ago

@poswald Closing the issue since it should now be all set. Don't hesitate to reach back if you need more help!

msvechla commented 6 years ago

fyi: This is still an issue, however we are in contact with IBM Cloud Container specialists to resolve this. A fix will be deployed this week to enable webhook authentication to kubelet.

JulienBalestra commented 6 years ago

@msvechla thanks for the update, let us know how it goes 👍

irabinovitch commented 6 years ago

@msvechla Do you have a ticket # or other details we could follow up with IBM on?

kerberos5 commented 4 years ago

hi, i have same problem... i have resolve I solved the issue with certificates by uploading them to the agent pod: into value.yaml (https://github.com/helm/charts/tree/master/stable/datadog) ` # agents.volumes -- Specify additional volumes to mount in the dd-agent container volumes:

- hostPath:

path:

name:

- hostPath:
    path: /home/docker/.minikube
  name: cert

agents.volumeMounts -- Specify additional volumes to mount in the dd-agent container

volumeMounts:

- name:

mountPath:

readOnly: true

- name: cert
  mountPath: /opt/datadog-agent/cert
  readOnly: true`

i have add this env: ` # agents.containers.agent.env -- Additional environment variables for the agent container env:

finally I recreated the minikube cluster listening on port 6443

now i recived this error into agent log:

` kube_controller_manager (1.7.0)

  Instance ID: kube_controller_manager:1476f03dc31e9882 [ERROR]
  Configuration Source: file:/etc/datadog-agent/conf.d/kube_controller_manager.d/auto_conf.yaml
  Total Runs: 378
  Metric Samples: Last Run: 0, Total: 0
  Events: Last Run: 0, Total: 0
  Service Checks: Last Run: 1, Total: 378
  Average Execution Time : 8ms
  Last Execution Date : 2020-10-27 13:34:23.000000 UTC
  Last Successful Execution Date : Never
  Error: HTTPConnectionPool(host='192.168.71.129', port=**10252**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))
  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
      conn = connection.create_connection(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
      raise err
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
      sock.connect(sa)
  ConnectionRefusedError: [Errno 111] Connection refused

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
      httplib_response = self._make_request(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request
      conn.request(method, url, **httplib_request_kw)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request
      self._send_request(method, url, body, headers, encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request
      self.endheaders(body, encode_chunked=encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders
      self._send_output(message_body, encode_chunked=encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output
      self.send(msg)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send
      self.connect()
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect
      conn = self._new_conn()
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
      raise NewConnectionError(
  urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
      resp = conn.urlopen(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
      retries = retries.increment(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
      raise MaxRetryError(_pool, url, error or ResponseError(cause))
  urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.71.129', port=**10252**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 828, in run
      self.check(instance)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kube_controller_manager/kube_controller_manager.py", line 148, in check
      self.process(scraper_config, metric_transformers=transformers)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 507, in process
      for metric in self.scrape_metrics(scraper_config):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 447, in scrape_metrics
      response = self.poll(scraper_config)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 713, in poll
      response = self.send_request(endpoint, scraper_config, headers)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 739, in send_request
      return http_handler.get(endpoint, stream=True, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get
      return self._request('get', url, options)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request
      return getattr(requests, method)(url, **new_options)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
      return request('get', url, params=params, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
      return session.request(method=method, url=url, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
      resp = self.send(prep, **send_kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
      r = adapter.send(request, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
      raise ConnectionError(e, request=request)
  requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.71.129', port=10252): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))

kube_scheduler (1.5.0)
----------------------
  Instance ID: kube_scheduler:f948e5430c3c100b [ERROR]
  Configuration Source: file:/etc/datadog-agent/conf.d/kube_scheduler.d/auto_conf.yaml
  Total Runs: 378
  Metric Samples: Last Run: 0, Total: 0
  Events: Last Run: 0, Total: 0
  Service Checks: Last Run: 1, Total: 378
  Average Execution Time : 7ms
  Last Execution Date : 2020-10-27 13:34:30.000000 UTC
  Last Successful Execution Date : Never
  Error: HTTPConnectionPool(host='192.168.71.129', port=**10251**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))
  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
      conn = connection.create_connection(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
      raise err
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
      sock.connect(sa)
  ConnectionRefusedError: [Errno 111] Connection refused

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
      httplib_response = self._make_request(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request
      conn.request(method, url, **httplib_request_kw)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request
      self._send_request(method, url, body, headers, encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request
      self.endheaders(body, encode_chunked=encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders
      self._send_output(message_body, encode_chunked=encode_chunked)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output
      self.send(msg)
    File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send
      self.connect()
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect
      conn = self._new_conn()
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
      raise NewConnectionError(
  urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
      resp = conn.urlopen(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
      retries = retries.increment(
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
      raise MaxRetryError(_pool, url, error or ResponseError(cause))
  urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.71.129', port=10251): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 828, in run
      self.check(instance)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kube_scheduler/kube_scheduler.py", line 139, in check
      self.process(scraper_config, metric_transformers=transformers)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 507, in process
      for metric in self.scrape_metrics(scraper_config):
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 447, in scrape_metrics
      response = self.poll(scraper_config)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 713, in poll
      response = self.send_request(endpoint, scraper_config, headers)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 739, in send_request
      return http_handler.get(endpoint, stream=True, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get
      return self._request('get', url, options)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request
      return getattr(requests, method)(url, **new_options)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
      return request('get', url, params=params, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
      return session.request(method=method, url=url, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
      resp = self.send(prep, **send_kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
      r = adapter.send(request, **kwargs)
    File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
      raise ConnectionError(e, request=request)
  requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.71.129', port=10251): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))`

is it normal that it tries to connect on other ports than 10250?

omrishilton commented 3 years ago

@kerberos5 I am also receiving the exact same errors, did you end up solving this issue?