Closed OurFriendIrony closed 1 year ago
From the logs and initial host name error it seems Agent is unable to connect to Kubelet. Could you please check these troubleshooting steps if they resolve your issue https://docs.datadoghq.com/agent/troubleshooting/hostname_containers/?tab=awsecsonec2#kubernetes-hostname-errors
Hi @levan-m,
Thanks for the response. I posted 2 configurations, the first I believe returns the hostname error you referenced. I used one of the steps in the troubleshooting to design the 2nd configuration, which produces different errors.
Below I have introduced "tlsVerify: false" (one of the other suggestions on the troubleshooting) and gotten the below issues
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
global:
kubelet:
tlsVerify: false
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
appSecret:
secretName: datadog-secret
keyName: app-key
features:
apm:
enabled: true
logCollection:
enabled: true
cluster-agent reports no ERROR, last log:
2023-06-05 11:29:30 UTC | CLUSTER | INFO | (pkg/clusteragent/clusterchecks/handler.go:198 in leaderWatch) | Found leadership status after 14 tries
regular agents are in CrashLoopBackOff, I've dumped out the non-INFO logs below
Defaulted container "agent" out of: agent, trace-agent, process-agent, init-volume (init), init-config (init)
2023-06-05 11:28:15 UTC | CORE | WARN | (pkg/util/log/log.go:618 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2023-06-05 11:28:19 UTC | CORE | WARN | (pkg/autodiscovery/providers/config_reader.go:174 in read) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
2023-06-05 11:28:19 UTC | CORE | WARN | (pkg/autodiscovery/providers/config_reader.go:174 in read) | Skipping, open : no such file or directory
2023-06-05 11:28:19 UTC | CORE | WARN | (pkg/secrets/secrets.go:50 in Init) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2023-06-05 11:28:19 UTC | CORE | ERROR | (pkg/util/version_history.go:103 in logVersionHistoryToFile) | Cannot write json file: /opt/datadog-agent/run/version-history.json open /opt/datadog-agent/run/version-history.json: permission denied
2023-06-05 11:28:19 UTC | CORE | ERROR | (pkg/dogstatsd/server.go:270 in NewServer) | can't listen: listen unixgram /var/run/datadog/statsd/dsd.socket: bind: permission denied
2023-06-05 11:28:20 UTC | CORE | WARN | (cmd/agent/common/misconfig/global.go:15 in ToLog) | misconfig: proc mount: failed to open /host/proc/1/mounts - proc fs inspection may not work: open /host/proc/1/mounts: permission denied
2023-06-05 11:28:21 UTC | CORE | WARN | (pkg/logs/auditor/auditor.go:184 in func2) | open /opt/datadog-agent/run/registry.json: permission denied
2023-06-05 11:28:24 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:datadog_cluster_agent | Error running check: [{"message": "HTTPConnectionPool(host='10.128.2.229', port=5000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe916e040>: Failed to establish a new connection: [Errno 113] No route to host'))", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py\", line 174, in _new_conn\n conn = connection.create_connection(\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py\", line 95, in create_connection\n raise err\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py\", line 85, in create_connection\n sock.connect(sa)\nOSError: [Errno 113] No route to host\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py\", line 703, in urlopen\n httplib_response = self._make_request(\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py\", line 398, in _make_request\n conn.request(method, url, **httplib_request_kw)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py\", line 239, in request\n super(HTTPConnection, self).request(method, url, body=body, headers=headers)\n File \"/opt/datadog-agent/embedded/lib/python3.8/http/client.py\", line 1256, in request\n self._send_request(method, url, body, headers, encode_chunked)\n File \"/opt/datadog-agent/embedded/lib/python3.8/http/client.py\", line 1302, in _send_request\n self.endheaders(body, encode_chunked=encode_chunked)\n File \"/opt/datadog-agent/embedded/lib/python3.8/http/client.py\", line 1251, in endheaders\n self._send_output(message_body, encode_chunked=encode_chunked)\n File \"/opt/datadog-agent/embedded/lib/python3.8/http/client.py\", line 1011, in _send_output\n self.send(msg)\n File \"/opt/datadog-agent/embedded/lib/python3.8/http/client.py\", line 951, in send\n self.connect()\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py\", line 205, in connect\n conn = self._new_conn()\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py\", line 186, in _new_conn\n raise NewConnectionError(\nurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdbe916e040>: Failed to establish a new connection: [Errno 113] No route to host\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py\", line 489, in send\n resp = conn.urlopen(\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py\", line 787, in urlopen\n retries = retries.increment(\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py\", line 592, in increment\n raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.128.2.229', port=5000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe916e040>: Failed to establish a new connection: [Errno 113] No route to host'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 1122, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 142, in check\n self.process(scraper_config)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 573, in process\n for metric in self.scrape_metrics(scraper_config):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 500, in scrape_metrics\n response = self.poll(scraper_config)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 837, in poll\n response = self.send_request(endpoint, scraper_config, headers)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 863, in send_request\n return http_handler.get(endpoint, stream=True, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py\", line 355, in get\n return self._request('get', url, options)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py\", line 419, in _request\n response = self.make_request_aia_chasing(request_method, method, url, new_options, persist)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py\", line 425, in make_request_aia_chasing\n response = request_method(url, **new_options)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py\", line 73, in get\n return request(\"get\", url, params=params, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py\", line 59, in request\n return session.request(method=method, url=url, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py\", line 587, in request\n resp = self.send(prep, **send_kwargs)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py\", line 701, in send\n r = adapter.send(request, **kwargs)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py\", line 565, in send\n raise ConnectionError(e, request=request)\nrequests.exceptions.ConnectionError: HTTPConnectionPool(host='10.128.2.229', port=5000): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe916e040>: Failed to establish a new connection: [Errno 113] No route to host'))\n"}]
2023-06-05 11:28:26 UTC | CORE | WARN | (pkg/util/cloudproviders/gce/gce_tags.go:50 in getCachedTags) | unable to get tags from gce and cache is empty: GCE metadata API error: Get "http://169.254.169.254/computeMetadata/v1/?recursive=true": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Any ideas for things to look into would be much appreciated
Issue is not in the operator, but in the implementation method used. For openshift, the operator needs to be installed using their Subscription method, as opposed to via helm. This resulted in a clean deployment
Describe what happened: I'm implementing a ROSA/OpenShift on AWS cluster, which is essentially Kubernetes with various deployments implemented.
As this is a fresh install, I am following the Getting Started documentation but I am currently unable to get the agent/cluster-agent to run successfully. The steps seem incredibly simple and I have not introduced any adjustments from the guide, so I'm at a loss at to the cause of the issue.
Describe what you expected: Following the "Getting Started" documentation, I expect 1 operator, 1 cluster-agent and 3 agent pods with status running. The intention is to have a single datadog operator, which an agent located in 3 different namespaces, but currently I have been unable to get running in a single namespace. Any suggestions would be appreciated.
Steps to reproduce the issue: In openshift cluster, in default namespace,
using the following agent configuration
Which produces an agent error:
I have then adjusted the spec to be the below, and reapplied
Pod status
I can see the following in the cluster-agent logs, which looks somewhat promising
agent logs are as follows:
Additional environment details (Operating System, Cloud provider, etc): Provider: ROSA (RedHat OpenShift on AWS) Kubernetes: v1.25.8+37a9a08 Namespace: default Datadog Plan: Datadog Pro (app.datadoghq.com)