DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.83k stars 1.19k forks source link

Agent cannot connect to kubelet #6621

Closed Gowiem closed 3 years ago

Gowiem commented 3 years ago

Output of the info page (if this is a bug)

Getting the status from the agent.

===============
Agent (v7.23.1)
===============

  Status date: 2020-10-23 00:47:05.200537 UTC
  Agent start: 2020-10-22 23:53:34.100610 UTC
  Pid: 435
  Go Version: go1.14.7
  Python Version: 3.8.5
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2020-10-23 00:47:05.200537 UTC

  Host Info
  =========
    bootTime: 2020-10-22 23:40:26.000000 UTC
    kernelArch: x86_64
    kernelVersion: 4.14.193-149.317.amzn2.x86_64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: bullseye/sid
    procs: 15
    uptime: 13m25s
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    socket-fqdn: REDACTED-7db7bfc879-s9df7
    socket-hostname: REDACTED-7db7bfc879-s9df7
    host tags:
      cluster_name:REDACTED
      env:REDACTED
    hostname provider: 
    unused hostname providers:
      configuration/environment: hostname is empty

  Metadata
  ========

=========
Collector
=========

  Running Checks
  ==============

    eks_fargate (1.1.1)
    -------------------
      Instance ID: eks_fargate:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/eks_fargate.d/conf.yaml.default
      Total Runs: 214
      Metric Samples: Last Run: 1, Total: 214
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-10-23 00:47:04.000000 UTC
      Last Successful Execution Date : 2020-10-23 00:47:04.000000 UTC

    kubelet (5.0.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 213
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2020-10-23 00:46:56.000000 UTC
      Last Successful Execution Date : Never
      Error: Unable to detect the kubelet URL automatically: cannot connect: https: "Get \"https://:10250/pods\": dial tcp :10250: connect: connection refused", http: "Get \"http://:10255/pods\": dial tcp :10255: connect: connection refused"
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 828, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py", line 295, in check
          raise CheckException("Unable to detect the kubelet URL automatically: " + kubelet_conn_info.get('err', ''))
      datadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: cannot connect: https: "Get \"https://:10250/pods\": dial tcp :10250: connect: connection refused", http: "Get \"http://:10255/pods\": dial tcp :10255: connect: connection refused"
========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 213
    Connections: 0
    Containers: 0
    Deployments: 0
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 22
    Metadata: 0
    Nodes: 0
    Pods: 0
    Processes: 0
    RTContainers: 0
    RTProcesses: 0
    ReplicaSets: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    Services: 0
    SketchSeries: 0
    Success: 448
    TimeseriesV1: 213

  API Keys status
  ===============
    API key ending with 42793: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 42793

Describe what happened:

I'm trying to run DataDog as a sidecar on my EKS Fargate Nodes/Pods, but I'm continuing to get the seemingly common "cannot connect to kubelet" like errors - this is the latest iteration:

2020-10-23T01:27:58.521589856Z starting agent
2020-10-23T01:27:58.522683161Z starting system-probe
2020-10-23T01:27:59.224556973Z [services.d] done.
2020-10-23T01:28:01.423422387Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/util/log/log.go:465 in func1) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:01.423628412Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/util/log/log.go:460 in func1) | Skipping TLS verification
2020-10-23T01:28:01.423877368Z 2020-10-23 01:28:01 UTC | PROCESS | WARN | (pkg/util/log/log.go:480 in func1) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:01.424106681Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/util/log/log.go:465 in func1) | overriding API key from env DD_API_KEY value
2020-10-23T01:28:01.424163083Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/process/config/config.go:295 in mergeConfigIfExists) | no config exists at /etc/datadog-agent/system-probe.yaml, ignoring...
2020-10-23T01:28:01.424191510Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/process/config/config.go:454 in loadEnvVariables) | overriding API key from env DD_API_KEY value
2020-10-23T01:28:01.424647590Z 2020-10-23 01:28:01 UTC | PROCESS | INFO | (pkg/process/config/yaml_config.go:189 in loadSysProbeYamlConfig) | network_config not found, enabling network check by default
2020-10-23T01:28:02.321248893Z 2020-10-23 01:28:02 UTC | CORE | INFO | (cmd/agent/app/run.go:183 in StartAgent) | Starting Datadog Agent v7.23.1
2020-10-23T01:28:02.321272392Z 2020-10-23 01:28:02 UTC | CORE | INFO | (cmd/agent/app/run.go:227 in StartAgent) | Hostname is: 
2020-10-23T01:28:04.321989049Z 2020-10-23 01:28:04 UTC | CORE | INFO | (pkg/api/security/security.go:145 in fetchAuthToken) | Saved a new authentication token to /etc/datadog-agent/auth_token
2020-10-23T01:28:04.424047973Z 2020-10-23 01:28:04 UTC | CORE | INFO | (cmd/agent/app/run.go:254 in StartAgent) | GUI server port -1 specified: not starting the GUI.
2020-10-23T01:28:04.424569167Z 2020-10-23 01:28:04 UTC | CORE | INFO | (pkg/forwarder/forwarder.go:270 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://7-23-1-app.agent.datadoghq.com" (1 api key(s))
2020-10-23T01:28:04.424723548Z 2020-10-23 01:28:04 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:176 in CheckConnectivity) | Checking HTTP connectivity...
2020-10-23T01:28:04.524575498Z 2020-10-23 01:28:04 UTC | TRACE | INFO | (pkg/util/log/log.go:465 in func1) | Loaded configuration: /etc/datadog-agent/datadog.yaml
2020-10-23T01:28:04.619969109Z 2020-10-23 01:28:04 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:182 in CheckConnectivity) | Sending HTTP connectivity request to https://agent-http-intake.logs.datadoghq.com/v1/input/***************************42793...
2020-10-23T01:28:04.620174238Z 2020-10-23 01:28:04 UTC | CORE | INFO | (pkg/dogstatsd/listeners/udp.go:97 in Listen) | dogstatsd-udp: starting to listen on 127.0.0.1:8125
2020-10-23T01:28:05.723968645Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/client/http/destination.go:187 in CheckConnectivity) | HTTP connectivity successful
2020-10-23T01:28:05.723991173Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/input/container/launcher.go:55 in NewLauncher) | Could not setup the docker launcher: temporary failure in dockerutil, will retry later: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2020-10-23T01:28:05.724000970Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/input/container/launcher.go:62 in NewLauncher) | Could not setup the kubernetes launcher: /var/log/pods not found
2020-10-23T01:28:05.724059580Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/input/container/launcher.go:71 in NewLauncher) | Container logs won't be collected unless a docker daemon is eventually started
2020-10-23T01:28:05.724072482Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/logs.go:85 in Start) | Starting logs-agent...
2020-10-23T01:28:05.724098372Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/logs/logs.go:88 in Start) | logs-agent started
2020-10-23T01:28:05.724107432Z 2020-10-23 01:28:05 UTC | CORE | INFO | (cmd/agent/app/run.go:312 in StartAgent) | System probe config not found, disabling pulling system probe info in the status page: open /etc/datadog-agent/system-probe.yaml: no such file or directory
2020-10-23T01:28:05.724114782Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/util/version_history.go:43 in logVersionHistoryToFile) | Cannot read file: /opt/datadog-agent/run/version-history.json, will create a new one. open /opt/datadog-agent/run/version-history.json: no such file or directory
2020-10-23T01:28:05.724122054Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:05.724138193Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:05.729744533Z 2020-10-23 01:28:05 UTC | CORE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:05.730304078Z 2020-10-23 01:28:05 UTC | CORE | INFO | (pkg/tagger/tagger.go:158 in tryCollectors) | static tag collector successfully started
2020-10-23T01:28:06.623134486Z 2020-10-23 01:28:06 UTC | TRACE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:06.623158307Z 2020-10-23 01:28:06 UTC | TRACE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:06.628984833Z 2020-10-23 01:28:06 UTC | TRACE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:06.724218223Z 2020-10-23 01:28:06 UTC | TRACE | INFO | (pkg/tagger/tagger.go:158 in tryCollectors) | static tag collector successfully started
2020-10-23T01:28:06.724338855Z 2020-10-23 01:28:06 UTC | TRACE | INFO | (pkg/trace/agent/run.go:131 in Run) | Trace agent running on host oc-location-7db7bfc879-9zjw5
2020-10-23T01:28:06.724352273Z 2020-10-23 01:28:06 UTC | TRACE | INFO | (pkg/trace/api/api.go:125 in Start) | Listening for traces at http://0.0.0.0:8126
2020-10-23T01:28:07.226389221Z 2020-10-23 01:28:07 UTC | PROCESS | INFO | (main_common.go:107 in runAgent) | running on platform: linux-4.14.193-149.317.amzn2.x86_64-x86_64-with-glibc2.2.5
2020-10-23T01:28:07.226479677Z 2020-10-23 01:28:07 UTC | PROCESS | INFO | (main_common.go:110 in runAgent) | running version: Version: 7.23.1, Git hash: 8099db1, Git branch: HEAD, Build date: 2020-10-20T22:32:54, Go Version: go version go1.14.7 linux/amd64, 
2020-10-23T01:28:07.821651248Z 2020-10-23 01:28:07 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:07.821673758Z 2020-10-23 01:28:07 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:07.826900130Z 2020-10-23 01:28:07 UTC | CORE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:07.828126859Z 2020-10-23 01:28:07 UTC | CORE | INFO | (pkg/collector/runner/runner.go:92 in NewRunner) | Runner started with 4 workers.
2020-10-23T01:28:07.919867043Z 2020-10-23 01:28:07 UTC | CORE | INFO | (pkg/collector/python/init.go:311 in Initialize) | Initializing rtloader with python3 /opt/datadog-agent/embedded
2020-10-23T01:28:09.227917701Z 2020-10-23 01:28:09 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:09.227951706Z 2020-10-23 01:28:09 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:09.321818958Z 2020-10-23 01:28:09 UTC | PROCESS | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:09.323896283Z 2020-10-23 01:28:09 UTC | PROCESS | INFO | (pkg/tagger/tagger.go:158 in tryCollectors) | static tag collector successfully started
2020-10-23T01:28:09.626237584Z 2020-10-23 01:28:09 UTC | CORE | INFO | (pkg/util/cloudprovider.go:54 in DetectCloudProvider) | No cloud provider detected
2020-10-23T01:28:10.725954689Z 2020-10-23 01:28:10 UTC | PROCESS | INFO | (pkg/process/checks/process.go:48 in Init) | no network ID detected: could not detect network ID
2020-10-23T01:28:10.726097973Z 2020-10-23 01:28:10 UTC | PROCESS | INFO | (collector.go:175 in run) | Starting process-agent for host=fargate-ip-10-9-22-178.ec2.internal, endpoints=[https://process.datadoghq.com], orchestrator endpoints=[https://orchestrator.datadoghq.com], enabled checks=[process rtprocess Network]
2020-10-23T01:28:10.726363579Z 2020-10-23 01:28:10 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:270 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://process.datadoghq.com" (1 api key(s))
2020-10-23T01:28:10.726497573Z 2020-10-23 01:28:10 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:270 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://orchestrator.datadoghq.com" (1 api key(s))
2020-10-23T01:28:10.821284195Z 2020-10-23 01:28:10 UTC | PROCESS | INFO | (collector.go:157 in runCheck) | Finished process check #1 in 94.543378ms
2020-10-23T01:28:10.922702914Z 2020-10-23 01:28:10 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:465 in func1) | no config exists at /etc/datadog-agent/system-probe.yaml, ignoring...
2020-10-23T01:28:10.922771328Z 2020-10-23 01:28:10 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:465 in func1) | overriding API key from env DD_API_KEY value
2020-10-23T01:28:10.922784613Z 2020-10-23 01:28:10 UTC | SYS-PROBE | INFO | (pkg/util/log/log.go:460 in func1) | network_config not found, enabling network check by default
2020-10-23T01:28:10.922871375Z 2020-10-23 01:28:10 UTC | SYS-PROBE | INFO | (cmd/system-probe/main.go:84 in runAgent) | system probe not enabled. exiting.
2020-10-23T01:28:11.426357142Z 2020-10-23 01:28:11 UTC | SECURITY | INFO | (app/app.go:165 in start) | All security-agent components are deactivated, exiting
2020-10-23T01:28:14.925534798Z 2020-10-23 01:28:14 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:122 in LogMessage) | - | (ddyaml.py:123) | monkey patching yaml.load...
2020-10-23T01:28:14.925694867Z 2020-10-23 01:28:14 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:122 in LogMessage) | - | (ddyaml.py:127) | monkey patching yaml.load_all...
2020-10-23T01:28:14.925905775Z 2020-10-23 01:28:14 UTC | CORE | INFO | (pkg/collector/python/datadog_agent.go:122 in LogMessage) | - | (ddyaml.py:131) | monkey patching yaml.dump_all... (affects all yaml dump operations)
2020-10-23T01:28:15.321453333Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/collector/collector.go:57 in NewCollector) | Embedding Python 3.8.5 (default, Oct 20 2020, 22:31:39) [GCC 4.7.2]
2020-10-23T01:28:15.525722651Z 2020-10-23 01:28:15 UTC | CORE | INFO | (cmd/agent/common/autoconfig.go:72 in SetupAutoConfig) | Registering kubelet config provider polled every 10s
2020-10-23T01:28:15.526204380Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:15.526332972Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:15.531728062Z 2020-10-23 01:28:15 UTC | CORE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:15.535856068Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/autodiscovery/autoconfig.go:363 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get \"https://:10250/pods\": dial tcp :10250: connect: connection refused", http: "Get \"http://:10255/pods\": dial tcp :10255: connect: connection refused"
2020-10-23T01:28:15.536002068Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/autodiscovery/providers/file.go:74 in Collect) | file: searching for configuration files at: /etc/datadog-agent/conf.d
2020-10-23T01:28:15.922179191Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/autodiscovery/providers/file.go:74 in Collect) | file: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
2020-10-23T01:28:15.922327117Z 2020-10-23 01:28:15 UTC | CORE | WARN | (pkg/autodiscovery/providers/file.go:78 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
2020-10-23T01:28:15.922478553Z 2020-10-23 01:28:15 UTC | CORE | INFO | (pkg/autodiscovery/providers/file.go:74 in Collect) | file: searching for configuration files at: 
2020-10-23T01:28:15.922568319Z 2020-10-23 01:28:15 UTC | CORE | WARN | (pkg/autodiscovery/providers/file.go:78 in Collect) | Skipping, open : no such file or directory
2020-10-23T01:28:16.020076169Z system-probe exited with code 0, disabling
2020-10-23T01:28:16.039140785Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/collector/scheduler/scheduler.go:85 in Enter) | Scheduling check eks_fargate with an interval of 15s
2020-10-23T01:28:16.039247025Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/collector/scheduler/scheduler.go:85 in Enter) | Scheduling check kubelet with an interval of 15s
2020-10-23T01:28:16.039383659Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/logs/scheduler/scheduler.go:66 in Schedule) | Received a new logs config: custom_log_collection
2020-10-23T01:28:16.120692131Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/logs/input/file/scanner.go:248 in handleTailingModeChange) | Tailing mode changed for file:/var/log/containers/application.log. Was: end: Now: beginning
2020-10-23T01:28:16.120876105Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/logs/input/file/scanner.go:225 in startNewTailer) | Starting a new tailer for: /var/log/containers/application.log (offset: 0, whence: 0) for tailer key /var/log/containers/application.log
2020-10-23T01:28:16.120974288Z 2020-10-23 01:28:16 UTC | CORE | INFO | (pkg/logs/input/file/tailer_nix.go:29 in setup) | Opening /var/log/containers/application.log for tailer key /var/log/containers/application.log
2020-10-23T01:28:16.430223728Z security-agent exited with code 0, disabling
2020-10-23T01:28:16.725359315Z 2020-10-23 01:28:16 UTC | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data received
2020-10-23T01:28:17.039473708Z 2020-10-23 01:28:17 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:eks_fargate | Running check
2020-10-23T01:28:17.040322890Z 2020-10-23 01:28:17 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:eks_fargate | Done running check
2020-10-23T01:28:19.447889419Z 2020-10-23 01:28:19 UTC | CORE | INFO | (pkg/forwarder/transaction.go:293 in internalProcess) | Successfully posted payload to "https://7-23-1-app.agent.datadoghq.com/api/v1/check_run?api_key=*************************42793", the agent will only log transaction success every 500 transactions
2020-10-23T01:28:20.824332895Z 2020-10-23 01:28:20 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:20.824514815Z 2020-10-23 01:28:20 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:20.829773592Z 2020-10-23 01:28:20 UTC | PROCESS | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:20.831014266Z 2020-10-23 01:28:20 UTC | PROCESS | INFO | (collector.go:157 in runCheck) | Finished process check #2 in 9.506448ms
2020-10-23T01:28:20.924494655Z 2020-10-23 01:28:20 UTC | CORE | WARN | (pkg/util/ec2/ec2_tags.go:90 in GetTags) | unable to get tags from aws and cache is empty: unable to fetch EC2 API, Get "http://169.254.169.254/latest/dynamic/instance-identity/document/": dial tcp 169.254.169.254:80: i/o timeout (Client.Timeout exceeded while awaiting headers)
2020-10-23T01:28:20.924685111Z 2020-10-23 01:28:20 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:20.924829655Z 2020-10-23 01:28:20 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:21.021496382Z 2020-10-23 01:28:21 UTC | CORE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:21.034212379Z 2020-10-23 01:28:21 UTC | PROCESS | INFO | (pkg/forwarder/transaction.go:293 in internalProcess) | Successfully posted payload to "https://process.datadoghq.com/api/v1/collector", the agent will only log transaction success every 500 transactions
2020-10-23T01:28:22.022345546Z 2020-10-23 01:28:22 UTC | CORE | WARN | (pkg/util/gce/gce_tags.go:48 in getCachedTags) | unable to get tags from gce and cache is empty: Get "http://169.254.169.254/computeMetadata/v1/?recursive=true": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2020-10-23T01:28:23.323157843Z 2020-10-23 01:28:23 UTC | CORE | INFO | (pkg/metadata/host/host.go:187 in getNetworkMeta) | could not get network metadata: could not detect network ID
2020-10-23T01:28:23.340037660Z 2020-10-23 01:28:23 UTC | CORE | INFO | (pkg/serializer/serializer.go:356 in sendMetadata) | Sent metadata payload, size (raw/compressed): 2839/1257 bytes.
2020-10-23T01:28:24.039319627Z 2020-10-23 01:28:24 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:kubelet | Running check
2020-10-23T01:28:24.039763654Z 2020-10-23 01:28:24 UTC | CORE | ERROR | (pkg/collector/python/kubeutil.go:40 in getConnections) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-10-23T01:28:24.041136843Z 2020-10-23 01:28:24 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 828, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 295, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically: \" + kubelet_conn_info.get('err', ''))\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"\n"}]
2020-10-23T01:28:24.041290328Z 2020-10-23 01:28:24 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:kubelet | Done running check
2020-10-23T01:28:25.526248759Z 2020-10-23 01:28:25 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-10-23T01:28:30.824412748Z 2020-10-23 01:28:30 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:30.824612150Z 2020-10-23 01:28:30 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:30.829810716Z 2020-10-23 01:28:30 UTC | PROCESS | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:30.830633075Z 2020-10-23 01:28:30 UTC | PROCESS | INFO | (collector.go:157 in runCheck) | Finished process check #3 in 9.154047ms
2020-10-23T01:28:31.523832204Z 2020-10-23 01:28:31 UTC | TRACE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:31.523950319Z 2020-10-23 01:28:31 UTC | TRACE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:31.528839712Z 2020-10-23 01:28:31 UTC | TRACE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:32.039495149Z 2020-10-23 01:28:32 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:eks_fargate | Running check
2020-10-23T01:28:32.039518121Z 2020-10-23 01:28:32 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:eks_fargate | Done running check
2020-10-23T01:28:33.627787576Z 2020-10-23 01:28:33 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:33.627878933Z 2020-10-23 01:28:33 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:33.632979356Z 2020-10-23 01:28:33 UTC | CORE | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:35.526102059Z 2020-10-23 01:28:35 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-10-23T01:28:39.039470107Z 2020-10-23 01:28:39 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:kubelet | Running check
2020-10-23T01:28:39.040448300Z 2020-10-23 01:28:39 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 828, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 295, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically: \" + kubelet_conn_info.get('err', ''))\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"\n"}]
2020-10-23T01:28:39.040604712Z 2020-10-23 01:28:39 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:kubelet | Done running check
2020-10-23T01:28:40.824446620Z 2020-10-23 01:28:40 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:40.824470177Z 2020-10-23 01:28:40 UTC | PROCESS | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification
2020-10-23T01:28:40.829379036Z 2020-10-23 01:28:40 UTC | PROCESS | WARN | (pkg/util/kubernetes/kubelet/kubelet.go:526 in setupKubeletAPIEndpoint) | Failed to securely reach the kubelet over HTTPS, received a status 403. Trying a non secure connection over HTTP. We highly recommend configuring TLS to access the kubelet
2020-10-23T01:28:40.830267214Z 2020-10-23 01:28:40 UTC | PROCESS | INFO | (collector.go:157 in runCheck) | Finished process check #4 in 8.786437ms
2020-10-23T01:28:45.526222028Z 2020-10-23 01:28:45 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-10-23T01:28:45.536641144Z 2020-10-23 01:28:45 UTC | CORE | INFO | (pkg/autodiscovery/autoconfig.go:363 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-10-23T01:28:47.039415952Z 2020-10-23 01:28:47 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:eks_fargate | Running check
2020-10-23T01:28:47.039750169Z 2020-10-23 01:28:47 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:eks_fargate | Done running check
2020-10-23T01:28:50.824650192Z 2020-10-23 01:28:50 UTC | PROCESS | INFO | (collector.go:159 in runCheck) | Finished process check #5 in 3.081572ms. First 5 check runs finished, next runs will be logged every 20 runs.
2020-10-23T01:28:54.039344258Z 2020-10-23 01:28:54 UTC | CORE | INFO | (pkg/collector/runner/runner.go:261 in work) | check:kubelet | Running check
2020-10-23T01:28:54.040321386Z 2020-10-23 01:28:54 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 828, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 295, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically: \" + kubelet_conn_info.get('err', ''))\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: cannot connect: https: \"Get \\\"https://:10250/pods\\\": dial tcp :10250: connect: connection refused\", http: \"Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused\"\n"}]
2020-10-23T01:28:54.040404723Z 2020-10-23 01:28:54 UTC | CORE | INFO | (pkg/collector/runner/runner.go:327 in work) | check:kubelet | Done running check
2020-10-23T01:28:55.526610792Z 2020-10-23 01:28:55 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet.go:714 in setKubeletHost) | EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at https://172.20.0.1:443/api/v1/nodes/fargate-ip-10-9-22-178.ec2.internal/proxy/
2020-10-23T01:28:55.526759965Z 2020-10-23 01:28:55 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/init.go:30 in buildTLSConfig) | Skipping TLS verification

The important bit and the one that continues to repeat itself is Get \\\"http://:10255/pods\\\": dial tcp :10255: connect: connection refused.

I followed this tutorial and the documentation to get this setup, but there is very little documentation on EKS + Fargate.

This is a similar issue to datadog/integrations#2582 && datadog/datadog-agent#2582 (and a bunch of others).

It is worth noting that I do have the datadog agent running successfully on my normal EKS worker nodes, but I have yet to have any success with Fargate. Would appreciate a pointer in the right direction or what I can do to further debug this. For example, I believe I have RBAC setup correctly (yaml below), but how can I test that? Thanks!

Describe what you expected:

I expected the pod to run without errors and be able to reach the kubelet.

Steps to reproduce the issue:

Here is my datadog agent sidecar helm template:

- image: datadog/agent:7
  name: datadog-agent

  ## Enabling port 8125 for DogStatsD metric collection
  ports:
  - containerPort: 8125
    name: dogstatsdport
    protocol: UDP

  - containerPort: 8126
    name: traceport
    protocol: TCP

  env:

  - name: DD_API_KEY
    valueFrom:
      secretKeyRef:
        name: datadog-secrets
        key: api-key

  - name: DD_APP_KEY
    valueFrom:
      secretKeyRef:
        name: datadog-secrets
        key: app-key

  - name: DD_ENV
    value: {{ include "REDACTED" . }}

  - name: DD_TAGS
    value: 'cluster_name:{{ include "REDACTED" . }}'

  - name: DD_KUBERNETES_POD_LABELS_AS_TAGS
    value: '{"app.kubernetes.io/name": "kube_app_name","app.kubernetes.io/version": "kube_app_version"}'

  - name: DD_COLLECT_KUBERNETES_EVENTS
    value: "true"

  - name: DD_LEADER_ELECTION
    value: "true"

  - name: DD_PROCESS_AGENT_ENABLED
    value: "true"

  - name: DD_LOG_LEVEL
    value: "INFO"

  - name: DD_LOGS_ENABLED
    value: "true"

  - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
    value: "true"

  - name: DD_CONTAINER_EXCLUDE
    value: "name:datadog-agent"

  - name: DD_APM_ENABLED
    value: "true"

  - name: DD_EKS_FARGATE
    value: "true"

  - name: DD_KUBELET_TLS_VERIFY
    value: "false"

  - name: DD_KUBERNETES_KUBELET_NODENAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

  resources:
    requests:
      memory: "256Mi"
      cpu: "200m"
    limits:
      memory: "256Mi"
      cpu: "200m"

  volumeMounts:
    - name: app-logs
      mountPath: /var/log/containers/

    - name: {{ include "oc-lib.ddConfigMapName" . }}
      mountPath: /etc/datadog-agent/conf.d/custom_log_collection.d/

The underlying app's service account has the following RBAC permissions bound to it and the service account directory is mount:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: datadog-agent
rules:
  - apiGroups:
      - ""
    resources:
      - nodes/metrics
      - nodes/spec
      - nodes/stats
      - nodes/proxy
      - nodes/pods
      - nodes/healthz
    verbs:
      - get

Additional environment details (Operating System, Cloud provider, etc):

Kubernetes Version: 1.17 EKS Platform: eks.3

dogewithit commented 3 years ago

I have the same issue

nmadmon commented 3 years ago

I have the same issue root@datadog-rx29v:/# env | grep DD_KUBERNETES_KUBELET_HOST DD_KUBERNETES_KUBELET_HOST=172.50.0.90 root@datadog-rx29v:/# curl $DD_KUBERNETES_KUBELET_HOST:10255/healthz curl: (7) Failed to connect to 172.50.0.90 port 10255: Connection refused

Gowiem commented 3 years ago

@assinnata @nmadmon I'm being told by DD support that this likely related to permissions, which I thought but didn't have a good way to test. I'll be testing that out today and if that ends up bring the case then I'll let you folks know.

nmadmon commented 3 years ago

@Gowiem , did you succeed to find the root cause?

Gowiem commented 3 years ago

@nmadmon @assinnata I did. It did end up being a RBAC permissions issue. Here are the notes from DD support that helped me figure that out:

Could you please, inside the Datadog Pod test the following command? TOKEN=$(</var/run/secrets/kubernetes.io/serviceaccount/token) && curl https://$DD_KUBERNETES_KUBELET_HOST:10250/pods -v -k -H "Authorization: Bearer $TOKEN"

If this works, let's try adding the SSL certifications. TOKEN=$(</var/run/secrets/kubernetes.io/serviceaccount/token) && curl https://$DD_KUBERNETES_KUBELET_HOST:10250/pods -v --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $TOKEN"

If this doesn't work, this issue might be an authorization issue.

Use the following Agent RBAC when deploying the Agent as a sidecar in AWS EKS Fargate:

apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRole
metadata:
name: datadog-agent
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/spec
- nodes/stats
- nodes/proxy
- nodes/pods
- nodes/healthz
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: datadog-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: datadog-agent
subjects:
- kind: ServiceAccount
name: datadog-agent
namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: datadog-agent
namespace: default

Could you connect directly to the host and run? ps aux | grep kubelet | grep -v grep

Is --authentication-token-webhook set?

Good luck with it!

bhanu8824 commented 1 year ago

"Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: 172.31.33.128. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made"

i am getting this error

diogobaeder commented 1 week ago

Setting the DD_KUBELET_TLS_VERIFY env var to "false" in the agent did the trick for me.