DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
346 stars 1.01k forks source link

AKS cluster using `datadog.kubelet.hostCAPath` has to set `tlsVerify: false` when upgrading #636

Closed bamarch closed 2 years ago

bamarch commented 2 years ago

Describe what happened:

We are using AKS and configured as per https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helm#AKS.

This is our config snippet

  kubelet:
    # datadog.kubelet.host -- Override kubelet IP
    host:
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    # AKS workaround recommended by Datadog https://github.com/DataDog/helm-charts/issues/114#issuecomment-768413178
    # ca.crt mounted by the service account is not the one used to target Kubelet
    # Kubelet cert does not have SAN for node IP, nor the reverse-lookup of the node IP
    hostCAPath: /etc/kubernetes/certs/kubeletserver.crt

Upgrading from 2.27.8 to 2.32.6 and keeping the configuration the same results in a status failure for "Kubelet".


    kubelet (7.2.1)
    ---------------
      Instance ID: kubelet:5bbc63f3938c02f4 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2ms
      Last Execution Date : 2022-05-23 12:48:43 UTC (1653310123000)
      Last Successful Execution Date : Never
      Error: Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: aks-usrhighmem02-12168730-vmss000002. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 1071, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py", line 295, in check
          raise CheckException("Unable to detect the kubelet URL automatically: " + kubelet_conn_info.get('err', ''))
      datadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: aks-usrhighmem02-12168730-vmss000002. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made

We don't get logs in the Datadog app for this cluster anymore

It looks like the agent was upgraded to 7.35.0, I've tested with 7.34.0 and this issue isn't present

Describe what you expected:

The "Kubelet" status check would remain working and logs would continue being visible in the Datadog app

Steps to reproduce the issue:

1: Use AKS

2: Configure the helm chart as per the docs from https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helm#AKS for using the support added in https://github.com/DataDog/helm-charts/pull/195

  kubelet:
    # datadog.kubelet.host -- Override kubelet IP
    host:
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    # AKS workaround recommended by Datadog https://github.com/DataDog/helm-charts/issues/114#issuecomment-768413178
    # ca.crt mounted by the service account is not the one used to target Kubelet
    # Kubelet cert does not have SAN for node IP, nor the reverse-lookup of the node IP
    hostCAPath: /etc/kubernetes/certs/kubeletserver.crt

Additional environment details (Operating System, Cloud provider, etc):

AKS cluster Using private networking with public DNS


The error message itself states "Error: Unable to detect the kubelet URL automatically: impossible to reach Kubelet with host: aks-usrhighmem02-12168730-vmss000002. Please check if your setup requires kubelet_tls_verify = false. Activate debug logs to see all attempts made"

Adding kubelet.tlsVerify: false to the chart values does fix the issue, so we aren't blocked and there is a workaround (i.e. accept the slightly weakened security posture)

I'm mainly wondering if support has dropped for the method of using hostCAPath to fix this issue, and whether documentation could do with updating if that's true? Or something in the changelog to warn people upgrading. Otherwise maybe it is an unintentional regression in the underlying kubelet.py core integration

bamarch commented 2 years ago

Some extra bit of context

===============
Agent (v7.35.2)
===============

  Status date: 2022-05-23 13:33:27.242 UTC (1653312807242)
  Agent start: 2022-05-23 13:19:43.72 UTC (1653311983720)
  Pid: 31381
  Go Version: go1.17.6
  Python Version: 3.8.11
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: DEBUG

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 2.017ms
    System time: 2022-05-23 13:33:27.242 UTC (1653312807242)

  Host Info
  =========
    bootTime: 2022-05-17 21:24:26 UTC (1652822666000)
    kernelArch: x86_64
    kernelVersion: 5.4.0-1077-azure
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 21.10
    procs: 223
    uptime: 135h55m24s
    virtualizationRole: host
    virtualizationSystem: kvm

  Hostnames
  =========
    cluster-name: aks-my-cluster-name
    host_aliases: [f4ee4ced-9ddb-459c-b5c9-bf61221abfd9]
    hostname: datadog-8xfg9
    socket-fqdn: datadog-8xfg9
    socket-hostname: datadog-8xfg9
    host tags:
      cluster_name:aks-my-cluster-name
      kube_cluster_name:aks-my-cluster-name
    hostname provider: os
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
      azure: azure_hostname_style is set to 'os'
      configuration/environment: hostname is empty
      container: Unable to get hostname from container API
      gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 400 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========
    agent_version: 7.35.2
    cloud_provider: Azure
    config_apm_dd_url:
    config_dd_url:
    config_logs_dd_url:
    config_logs_socks5_proxy_address:
    config_no_proxy: []
    config_process_dd_url:
    config_proxy_http:
    config_proxy_https:
    config_site:
    feature_apm_enabled: false
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_logs_enabled: true
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: false
    feature_processes_container_enabled: true
    flavor: agent
    hostname_source: os
    install_method_installer_version: datadog-2.33.7
    install_method_tool: helm
    install_method_tool_version: Helm
    logs_transport: HTTP
=========
APM Agent
=========
  Status: Running
  Pid: 31443
  Uptime: 832 seconds
  Mem alloc: 21,049,136 bytes
  Hostname: datadog-8xfg9
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.us3.datadoghq.com

  Receiver (previous minute)
  ==========================
    From .NET 6.0.5 (.NET), client 2.1.0.0
      Traces received: 73 (393,491 bytes)
      Spans received: 264

    From .NET 6.0.4 (.NET), client 2.4.4.0
      Traces received: 21 (23,047 bytes)
      Spans received: 42

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

=========
Aggregator
=========
  Checks Metric Sample: 136,468
  Dogstatsd Metric Sample: 16,075
  Event: 10
  Events Flushed: 10
  Number Of Flushes: 54
  Series Flushed: 94,401
  Service Check: 696
  Service Checks Flushed: 741
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 16,074
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 1,798,394
  Udp Packet Reading Errors: 0
  Udp Packets: 7,430
  Uds Bytes: 373,541
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 922
  Unterminated Metric Errors: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://10.200.119.172:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.19.0+commit.083a221

=============
Autodiscovery
=============
  Enabled Features
  ================
    containerd
    cri
    kubernetes
vboulineau commented 2 years ago

Hello,

Yes that's because newer Agent versions are built with go 1.17, which dropped support for certificates without SAN entirely (previously we were using x509ignoreCN=0), we need to update the documentation to reflect that issue.

Unfortunately it means we need to force tlsVerify: false on AKS, there's not much we can do on our side, we already reported this to Azure, but still not fixed on their side.

bamarch commented 2 years ago

Hello,

Yes that's because newer Agent versions are built with go 1.17, which dropped support for certificates without SAN entirely (previously we were using x509ignoreCN=0), we need to update the documentation to reflect that issue.

Unfortunately it means we need to force tlsVerify: false on AKS, there's not much we can do on our side, we already reported this to Azure, but still not fixed on their side.

Understood thanks for getting back to me about this

Will be great when AKS updates their certificates finally

Cheers!

vboulineau commented 2 years ago

Documentation at https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helm#AKS has been updated