DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.89k stars 1.21k forks source link

datadog agent fails liveness checks continuously on k8s #4606

Open roy-work opened 4 years ago

roy-work commented 4 years ago

Output of the info page (if this is a bug)

Getting the status from the agent.

===============
Agent (v6.14.1)
===============

  Status date: 2019-12-13 16:31:48.728820 UTC
  Agent start: 2019-12-13 16:31:30.122193 UTC
  Pid: 335
  Go Version: go1.12.9
  Python Version: 3.7.4
  Check Runners: 4
  Log Level: DEBUG

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2019-12-13 16:31:48.728820 UTC

  Host Info
  =========
    bootTime: 2019-10-04 18:22:49.000000 UTC
    kernelVersion: 4.15.0-1059-azure
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.1
    procs: 70
    uptime: 1678h8m59s

  Hostnames
  =========
    host_aliases: [d7916b3a-7a76-4d01-8394-2fae1cba2306 aks-agentpool-26862639-2-aksdev-v2]
    hostname: aks-agentpool-26862639-2
    socket-fqdn: aks-agentpool-26862639-2
    socket-hostname: aks-agentpool-26862639-2
    host tags:
      x-cm-cluster-name:aksdev
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

=========
Collector
=========

  Running Checks
  ==============
    No checks have run yet

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 0
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 0
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 0
    TimeseriesV1: 0

  API Keys status
  ===============
    API key ending with c122b: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - c122b

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Dogstatsd Metric Sample: 1
  Event: 1

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 0
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 1
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

=====================
Datadog Cluster Agent
=====================

  - Datadog Cluster Agent endpoint detected: https://10.120.110.225:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.3.2+commit.e3f5101

Describe what happened: 2 our of our 5 DD pods keep failing liveness probes, and subsequently get restarted by Kubernetes. The output,

Events:
  Type     Reason     Age                     From                               Message
  ----     ------     ----                    ----                               -------
  Warning  Unhealthy  54m (x1734 over 6d14h)  kubelet, aks-agentpool-26862639-2  Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 5 unhealthy components ===
aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 5 unhealthy components
  Warning  Unhealthy  14m (x2526 over 6d14h)  kubelet, aks-agentpool-26862639-2  Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 10 unhealthy components ===
ad-config-provider-docker, ad-config-provider-kubernetes, ad-dockerprovider, ad-kubeletlistener, ad-servicelistening, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 10 unhealthy components
  Warning  BackOff  4m12s (x32871 over 6d14h)  kubelet, aks-agentpool-26862639-2  Back-off restarting failed container
NAME                                            READY   STATUS             RESTARTS   AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
<snip>-datadog-hgt57                            0/1     CrashLoopBackOff   2716       38d   10.120.0.66    aks-agentpool-26862639-2   <none>           <none>

Describe what you expected: No crashes. :)

Steps to reproduce the issue: We're not sure, in particular, we're not sure why it's only these two pods that seem to have so much trouble.

Additional environment details (Operating System, Cloud provider, etc): This is running on an Azure AKS cluster.

DylanLovesCoffee commented 4 years ago

@roy-work Would you be able to upgrade the agent to the latest version, and also take advantage of the HTTP livenessProbe we've added (example here, which will require the DD_HEALTH_PORT env var as well)? If this doesn't solve the CrashLoopBackOff could you open a support ticket with us for us to investigate?

tonyffrench commented 4 years ago

working through a support ticket at the moment, we'll report back here if any progress is made. we upgraded to 7.17.1 and then to 7.18.0 without much success.

Sheepux commented 4 years ago

@tonyffrench I'm interested with the results as we're experiencing the same issue (agent v6.14 -> liveness status http 500). We need to investigate more but if you have some infos that could help.

mswezey23 commented 4 years ago

Interested on the resolution here as well.

mswezey23 commented 4 years ago

Update.

For me. the name was not appearing on helm ls --namespace NAMESPACE command. For S's & G's, I ran the delete command: helm3 del <release-name> --namespace <namespace> and it deleted succesfully.

Then I was able to proceed.

li-adrienloiseau commented 4 years ago

Hello, version: 2.3.41 same error: Readiness probe failed: HTTP probe failed with statuscode: 500

matt-dalton commented 3 years ago

Anyone managed to resolve this?

elainaRenee commented 3 years ago

I was having this same issue. Increasing initialDelaySeconds to 30 in the livenessProbe configuration seemed to have helped.

echung808 commented 2 years ago

Seeing this issue on version 2.37.9. I increased initialDelaySeconds to 30 as suggested above but the issue came back after a 2 minute delay. Despite the liveness and readiness probe failing, the cluster agent seems to be running ok as I can see my cluster's metrics in Datadog.

Some people seem to be "disabling" the probes: https://github.com/DataDog/datadog-agent/issues/5908#issuecomment-659533683

Logs for cluster agent

2022-10-21 17:56:26 UTC | CLUSTER | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]
2022-10-21 17:56:26 UTC | CLUSTER | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]

I have not yet tried on the latest version of the chart which at time of writing is 3.1.10.