DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
340 stars 1.01k forks source link

HELP error installing datadog agent on kubernetes 1.27.3 on premise cluster with helm charts #1097

Open llyons opened 1 year ago

llyons commented 1 year ago

Describe what happened:

In a trial with Datadog to determine if we should move forward with datadog as a monitoring solution. Trying to install the datadog kubernetes agent on a on premise k8s custer.

we are running 1.27.3 k8s on Centos 7 linux machines.

we have tried the manifest datadog operator approach and the helm chart.

Focusing on using the helm chart approach

our values.yaml file is this.

registry: gcr.io/datadoghq

datadog:

  apiKey:  # <DATADOG_API_KEY>
  apiKeyExistingSecret:   datadog-secret
  appKeyExistingSecret:   datadog-secret
  site:  us3.datadoghq.com
  kubeStateMetricsEnabled: true
  kubeStateMetricsNetworkPolicy:
    create: false

  kubeStateMetricsCore:
    enabled: true

  clusterName: olh-k8s-upper

  kubelet:
    host:
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    tlsVerify:  false

  logs:

    enabled: true
    containerCollectAll: true

  apm:

    socketEnabled: true
    portEnabled: true
    enabled: false
    port: 8126

  env:
    - name: DD_HOSTNAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName

  serviceMonitoring:

    enabled: true

  prometheusScrape:

    enabled: true
    serviceEndpoints: true

  processAgent:
    enabled: true
  criSocketPath:  /var/run/containerd/containerd.sock

we setup a datadog secret as follows.

kubectl create secret generic datadog-secret --from-literal api-key=5d299a1b5a9e758e0b3.......... --from-literal app-key=e43c096eed3adfa18...........

we executed the helm install like this.

helm install datadog -f values.yaml --set datadog.apiKey=5d299a1b5a9e758e0b3............. datadog/datadog --set targetSystem=linux

kubect get po -A

default datadog-5dsrp 3/4 CrashLoopBackOff 6 (77s ago) 7m21s default datadog-cluster-agent-586d86b7d6-f5252 1/1 Running 0 9m16s default datadog-hx7qb 3/4 CrashLoopBackOff 6 (3m21s ago) 9m16s default datadog-kube-state-metrics-5c77dcd6d5-97gvq 1/1 Running 0 9m16s

We are getting a number of errors with none of the agents coming up ( Sorry the logs generated from doing k logs datadog-2rgn9 -c agent are very large )

2023-06-27 14:31:37 UTC | CORE | INFO | (pkg/util/kubernetes/kubelet/kubelet_client.go:285 in checkKubeletConnection) | Successful configuration found for Kubelet, using URL: https://172.29.4.71:10250
2023-06-27 14:31:38 UTC | CORE | ERROR | (pkg/workloadmeta/collectors/internal/kubemetadata/kubemetadata.go:73 in Start) | Could not initialise the communication with the cluster agent: temporary failure in clusterAgentClient, will retry later: "https://10.105.173.154:5005/version" is unavailable: timeout calling "https://10.105.173.154:5005/version": Get "https://10.105.173.154:5005/version": dial tcp 10.105.173.154:5005: i/o timeout

2023-06-27 14:31:40 UTC | CORE | ERROR | (pkg/workloadmeta/collectors/internal/kubemetadata/kubemetadata.go:73 in Start) | Could not initialise the communication with the cluster agent: temporary failure in clusterAgentClient, will retry later: "https://10.105.173.154:5005/version" is unavailable: timeout calling "https://10.105.173.154:5005/version": Get "https://10.105.173.154:5005/version": dial tcp 10.105.173.154:5005: i/o timeout
2023-06-27 14:31:41 UTC | CORE | ERROR | (pkg/workloadmeta/collectors/internal/kubemetadata/kubemetadata.go:73 in Start) | Could not initialise the communication with the cluster agent: temporary failure in clusterAgentClient, will retry later: try delay not elapsed yet
2023-06-27 14:31:42 UTC | CORE | ERROR | (pkg/workloadmeta/collectors/internal/kubemetadata/kubemetadata.go:73 in Start) | Could not initialise the communication with the cluster agent: temporary failure in clusterAgentClient, will retry later: try delay not elapsed yet

Describe what you expected:

Expected all the agents to start up and run. Here is the output of the helm install

NAME: datadog
LAST DEPLOYED: Tue Jun 27 08:40:35 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Datadog agents are spinning up on each node in your cluster. After a few
minutes, you should see your agents starting in your event stream:
    https://app.datadoghq.com/event/explorer
You disabled creation of Secret containing API key, therefore it is expected
that you create Secret named 'datadog-secret' which includes a key called 'api-key' containing the API key.

The Datadog Agent is listening on port 8126 for APM service.

###################################################################################
####   WARNING: Cluster-Agent should be deployed in high availability mode     ####
###################################################################################

The Cluster-Agent should be in high availability mode because the following features
are enabled:
* Admission Controller

To run in high availability mode, our recommendation is to update the chart
configuration with:
* set `clusterAgent.replicas` value to `2` replicas .
* set `clusterAgent.createPodDisruptionBudget` to `true`.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

on prem k8s cluster

Centos 7 machines kubernetes 1.27.3 1 control plane, 2 linux workers

levan-m commented 1 year ago

Hello, Thanks for submitting the issue. Few question which would help us debug the issue:

llyons commented 1 year ago

The api key and app key are both in us3.datadoghq.com

We do have some data showing up in us3 however 2 of the pods are not running.

default datadog-cluster-agent-586d86b7d6-f5252 1/1 Running 0 66m default datadog-frc6x 3/4 CrashLoopBackOff 4 (25s ago) 2m4s default datadog-kube-state-metrics-5c77dcd6d5-97gvq 1/1 Running 0 66m default datadog-tlnq4 3/4 CrashLoopBackOff 4 (31s ago) 2m4s

get logs for agent I see no errors.

kubectl logs datadog-frc6x -c agent ---> no errors kubectl logs datadog-tlnq4 -c agent ---> no errors

the system-probe logs shows some errors.

kubectl logs datadog-frc6x -g system-probe

faccessat2 seems blocked by the seccomp profile of an old version of docker.
clone3 seems blocked by the seccomp profile of an old version of docker.
load a seccomp profile to force ENOSYS.
2023-06-27 15:42:11 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.syscall_monitor.enabled
2023-06-27 15:42:11 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.activity_dump.cgroup_wait_list_size
2023-06-27 15:42:11 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Unknown key in config file: runtime_security_config.network.enabled
2023-06-27 15:42:11 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:618 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (pkg/config/environment_detection.go:123 in detectFeatures) | 3 Features detected from environment: kubernetes,cri,containerd
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (pkg/runtime/runtime.go:27 in func1) | runtime: final GOMAXPROCS value is: 4
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (comp/core/log/logger.go:87 in Infof) | starting system-probe v7.45.0
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (pkg/network/tracer/utils_linux.go:34 in IsTracerSupportedByOS) | running on platform: centos
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/modules/network_tracer.go:60 in func3) | enabling universal service monitoring (USM)
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (pkg/network/tracer/tracer.go:126 in newTracer) | detected kernel version 3.10.0, will use kprobes from kernel version < 4.1.0
2023-06-27 15:42:11 UTC | SYS-PROBE | ERROR | (cmd/system-probe/api/module/loader.go:65 in Register) | error creating module network_tracer: Universal Service Monitoring (USM) requires a Linux kernel version of 4.14.0 or higher. We detected 3.10.0
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module tcp_queue_length_tracer disabled
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module oom_kill_probe disabled
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module event_monitor disabled
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module process disabled
2023-06-27 15:42:11 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:55 in Register) | module dynamic_instrumentation disabled
2023-06-27 15:42:11 UTC | SYS-PROBE | CRITICAL | (comp/core/log/logger.go:108 in Criticalf) | error while starting api server, exiting: failed to create system probe: no module could be loaded
Error: error while starting api server, exiting: failed to create system probe: no module could be loaded

BTW, here are the exact versions of Centos we have.

Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-1160.90.1.el7.x86_64 Architecture: x86-64

It looks like I might need to downgrade.... Can you help me with what or how I might need to change our values.yam to make this happen?

thanks

i tried to add this into the values.yaml and it didnt change the agent version. I am not sure which agent version will work anyways but still here trying.

clusterAgent:

enabled: true
image:
   name: cluster-agent
   tag: 7.21.1
llyons commented 1 year ago

so I was told that I should disable ServiceMonitoring in values.yaml

serviceMonitoring: enabled: false

and now the pods are all running

not sure what we lose by turning this off.

levan-m commented 1 year ago

Hello, Sorry for the delay in responding on the issue. I suppose you already answered your question based on above findings.

Universal Service Monitoring/USM controlled by serviceMonitoring.enabled property isn't compatible with your current environment running Linux Kernel 3.10.0/CentOS Linux 7.

These are prerequisites for from the USM doc

Your service must be running on one of the following supported platforms
  Linux Kernel 4.14 and greater
  CentOS or RHEL 8.0 and greater

Hence the error log:

2023-06-27 15:42:11 UTC | SYS-PROBE | ERROR | (cmd/system-probe/api/module/loader.go:65 in Register) | error creating module network_tracer: Universal Service Monitoring (USM) requires a Linux kernel version of 4.14.0 or higher. We detected 3.10.0

Regarding what you lose, this doc provides a good overview of USM. In a nutshell, with USM you gain visibility into you stacks without instrumenting code.

Please let me know if you have any questions.