Open poswald opened 6 years ago
Hi,
Thanks for reaching out about this issue! It seems like we are already investigating this on one of our support tickets, so we will take back investigation and be coming back to you through this ticket.
Regards
Based on your documentation I'm trying to configure the certificates.
This error message is quite misleading because this seems to be the real error:
2018-02-20 03:02:16 UTC | ERROR | dd.collector | collector(kubeutil.py:209) | Failed to initialize kubelet connection. Will retry 0 time(s). Error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)
But that's not the one you end up showing to the user.
Doing something like this from the agent's pod gives a warning about self-signed certificates...
root@dd-agent-5rp7j:/# curl -v --cacert /host/etc/kubernetes/cert/kubelet.pem https://10.132.135.141:10250/healthz
I think that's basically the same as the error message being reported by the python/requests library.
I have no idea what pem file I could pass to it to make it happy.
This may be caused by the way Bluemix is signing certificates. They may not be signed correctly to allow access to the kubelet from the agent's pod. We will have to reproduce the setup, and check the validity of the certificates there. Please bear with us while we do so. If we can make sure of this, we will then contact them.
We will continue to update you about this in the support ticket we have been using.
We've shared this issue with the IBM Cloud team. Hopefully they'll chime in soon as well.
If running inside a pod, the kubelet's certificate is validated against the cluster root CA, mounted in /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
. Could you try to validate this with curl?
@poswald Did @xvello 's suggestion work?
Thanks for you patience while we are looking into this!
The problem can occur from different sources:
The Pod /run/secrets/kubernetes.io/serviceaccount/ca.crt
is provided by the hyperkube controller-manager --root-ca-file
flag.
It's not mandatory for the kubelet to use certificates from the same PKI as the controller-manager.
The kubelet can be started with different configurations:
1. TLS:
By setting the kubelet's appropriated flags --tls-cert-file
and --tls-private-key-file
to certs issued from various PKI.
If the PKI is the same as the --root-ca-file
everything is fine.
Otherwise you must provide an additional certificate authority to establish a HTTPS connection with the kubelet.
2. Self-signed:
If none of --tls-cert-file
and --tls-private-key-file
are set, the kubelet will generate self-signed certificates into the --cert-dir
: /var/lib/kubelet/pki
by default.
As a note: Kubernetes CSR:
Set the kubelet --bootstrap-kubeconfig
flag to request a client certificate from the API server and then store them to the --cert-dir
.
The kubelet will submit a CSR and when it's approved, the certificates will be issued by the controller-manager. The controller-manager will use the flags --cluster-signing-cert-file
and --cluster-signing-key-file
and they can be are different from the --root-ca-file
.
This is added as a note as it manages only communication between the kubelet client and the API server, not between te kubelet server API (e.g. /pods
) and the pods client.
In order to check if your configuration correctly matches one of those two, you can run this:
curl -v --cacert /run/secrets/kubernetes.io/serviceaccount/ca.crt \
https://${status.nodeIP}:10250 \
-H "Authorization: Bearer $(< /run/secrets/kubernetes.io/serviceaccount/token)"
If this doesn't work, then most probably the kubelet configuration is not correct to allow access from the agent pod.
Aside of this configuration matters, during our investigation, we stumbled into an issue with Python requests version 2.11.1:
verify = "/path/to/issuing_ca"
r = requests.get("https://192.168.1.1:10250/pods", verify=verify, cert=None)
will produce the following stack trace:
...
requests.exceptions.SSLError: HTTPSConnectionPool(host='192.168.1.1', port=10250): Max retries exceeded with url: /pods (Caused by SSLError(CertificateError("hostname '192.168.1.1' doesn't match either of 'e2e', '192.168.254.1', '127.0.0.1', '192.168.1.1'",),))
(With the following details in the CA:CN=e2e
and X509v3 Subject Alternative Name: DNS:e2e
)
Upgrading the library solves this issue, so we will at least work to provide a fix for this. This could actually be part of the issue you are encountering.
We will keep looking into this and be updating you.
After investigating the different certificates provided by Bluemix clusters, it seems like the certificate you need to access the kubernetes is located on the node in /var/lib/kubelet/pki/kubelet.crt
. It is a self-signed certificate (as in case 2 of the previous message), so you will have to mount it in the agent to authenticate with the kubelet.
You can check that it's the correct one by running on the host curl https://127.0.0.1:10250/healthz --cacert /var/lib/kubelet/pki/kubelet.crt
: you should get an "unauthorized", which shows the certificate is accepted.
However, testing this with the agent is failing because of the already mentioned issue with the requests library which is bundled with the agent. The issue lies with an error handling the Subject Alternate Name that has to be used here. We are currently working on updating this library in the agent to fix this issue. Hopefully, this will finally allow this setup to work. Don't hesitate to ask if you have further questions regarding this matter!
I'll be honest, I haven't tried as I had basically given up on it as I just didn't have enough visibility into how the IBM servers were set up to figure it out. Thank you for following up on this.
I'll give this another shot.
I was looking and I realized that there is a helm chart so perhaps I'll give that a go as well, although I suspect I'll have to use my own hand-created yaml file to get the host mounts. You might want to look into making sure that thing knows how to mount the certs into the agent pod as well.
I'll keep an eyes peeled for a release that closes this issue.
@poswald Indeed, even if you move to the helm chart, you would need to specify manually the mount for the kubelet certificate. This should be pretty standard configuration, so please ask if you need help with that.
Actually, given that requests issue, I would advise you to wait for the next release of the agent, which will update the library, before starting to try it again as it is deemed to fail until then.
@antoinepouille Were you referring to 6.1?
@irabinovitch The datadog/agent:6.1.0
release solved the SSL issue with the upgrade of python requests (2.18.4).
@poswald Agent 6.1.0 is now out. Could you test out the kubelet check with the new details we provided you? Please tell us if you still encounter issues then.
I have the same issue. I just deployed agent 6.1.0 but the check breaks on not being able to verify the secure port's CA cert. This makes sense because the kubelet's server certificate is self-signed. Is there an option (envvar) I can set to ignore CA verification?
[ AGENT ] 2018-03-27 12:34:15 UTC | ERROR | (autoconfig.go:446 in collect) | Unable to collect configurations from provider Kubernetes pod annotation: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.77.0.54:10250/pods: x509: cannot validate certificate for 10.77.0.54 because it doesn't contain any IP SANs", http: "Get http://10.77.0.54:10255/pods: dial tcp 10.77.0.54:10255: getsockopt: connection refused"
@mhulscher Are you using IBM cloud? The certificate setup is different according to your Kubernetes distribution. You should be able to disable it using this option, but we are working on a bug related to that option so it may not work until a next version of the agent. If you still have issues with that, please open up a case with our support where we will be able to help you, using your logs and config.
Cheers
@poswald Closing the issue since it should now be all set. Don't hesitate to reach back if you need more help!
fyi: This is still an issue, however we are in contact with IBM Cloud Container specialists to resolve this. A fix will be deployed this week to enable webhook authentication to kubelet.
@msvechla thanks for the update, let us know how it goes 👍
@msvechla Do you have a ticket # or other details we could follow up with IBM on?
hi, i have same problem... i have resolve I solved the issue with certificates by uploading them to the agent pod: into value.yaml (https://github.com/helm/charts/tree/master/stable/datadog) ` # agents.volumes -- Specify additional volumes to mount in the dd-agent container volumes:
- hostPath:
path: /home/docker/.minikube
name: cert
volumeMounts:
- name: cert
mountPath: /opt/datadog-agent/cert
readOnly: true`
i have add this env: ` # agents.containers.agent.env -- Additional environment variables for the agent container env:
finally I recreated the minikube cluster listening on port 6443
now i recived this error into agent log:
Instance ID: kube_controller_manager:1476f03dc31e9882 [ERROR]
Configuration Source: file:/etc/datadog-agent/conf.d/kube_controller_manager.d/auto_conf.yaml
Total Runs: 378
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 378
Average Execution Time : 8ms
Last Execution Date : 2020-10-27 13:34:23.000000 UTC
Last Successful Execution Date : Never
Error: HTTPConnectionPool(host='192.168.71.129', port=**10252**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
conn = connection.create_connection(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect
conn = self._new_conn()
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.71.129', port=**10252**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 828, in run
self.check(instance)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kube_controller_manager/kube_controller_manager.py", line 148, in check
self.process(scraper_config, metric_transformers=transformers)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 507, in process
for metric in self.scrape_metrics(scraper_config):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 447, in scrape_metrics
response = self.poll(scraper_config)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 713, in poll
response = self.send_request(endpoint, scraper_config, headers)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 739, in send_request
return http_handler.get(endpoint, stream=True, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get
return self._request('get', url, options)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request
return getattr(requests, method)(url, **new_options)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.71.129', port=10252): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacf220>: Failed to establish a new connection: [Errno 111] Connection refused'))
kube_scheduler (1.5.0)
----------------------
Instance ID: kube_scheduler:f948e5430c3c100b [ERROR]
Configuration Source: file:/etc/datadog-agent/conf.d/kube_scheduler.d/auto_conf.yaml
Total Runs: 378
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 378
Average Execution Time : 7ms
Last Execution Date : 2020-10-27 13:34:30.000000 UTC
Last Successful Execution Date : Never
Error: HTTPConnectionPool(host='192.168.71.129', port=**10251**): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
conn = connection.create_connection(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect
conn = self._new_conn()
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.71.129', port=10251): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 828, in run
self.check(instance)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kube_scheduler/kube_scheduler.py", line 139, in check
self.process(scraper_config, metric_transformers=transformers)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 507, in process
for metric in self.scrape_metrics(scraper_config):
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 447, in scrape_metrics
response = self.poll(scraper_config)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 713, in poll
response = self.send_request(endpoint, scraper_config, headers)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 739, in send_request
return http_handler.get(endpoint, stream=True, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 283, in get
return self._request('get', url, options)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 332, in _request
return getattr(requests, method)(url, **new_options)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.71.129', port=10251): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f902bacfe80>: Failed to establish a new connection: [Errno 111] Connection refused'))`
is it normal that it tries to connect on other ports than 10250?
@kerberos5 I am also receiving the exact same errors, did you end up solving this issue?
When deploying to an IBM Cloud cluster (kubernetes 1.9) the Datadog collector does not work. This has reported to Datadog support as ticket #129722
It is still not clear to me if this is an issue with IBM or with Datadog.
There is a workaround: after disabling TLS verification
kubelet_tls_verify
in/etc/dd-agent/conf.d/kubernetes.yaml
it connects ok however, we cannot run this way in production. The bearer token path/var/run/secrets/kubernetes.io/serviceaccount/token
appears to be populated ok:Output of the info page
Output of the collector log:
Steps to reproduce the issue:
Describe the results you received:
Infrastructure host info in Datadog console reports: