DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.89k stars 1.21k forks source link

/ready and /live returns HTTP 200 but readiness and liveness probe are failing #6046

Open cp-jennifer opened 4 years ago

cp-jennifer commented 4 years ago

Output of the info page (if this is a bug)

Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.7.0)
==============================

  Status date: 2020-07-24 00:06:35.235643 UTC
  Agent start: 2020-07-24 00:03:40.233598 UTC
  Pid: 1
  Go Version: go1.13.11
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: DEBUG

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2020-07-24 00:06:35.235643 UTC

  Hostnames
  =========
    ec2-hostname: [X]
    hostname: [X]
    instance-id: [X]
    socket-fqdn: datadog-cluster-agent-[X]
    socket-hostname: datadog-cluster-agent-[X]
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET [X]

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-[X]
  Last Acquisition of the lease: Fri, 24 Jul 2020 00:04:01 UTC
  Renewed leadership: Fri, 24 Jul 2020 00:06:32 UTC
  Number of leader transitions: 26 transitions

Custom Metrics Server
=====================
  Disabled: The external metrics provider is not enabled on the Cluster Agent

Cluster Checks Dispatching
==========================
  Status: Leader, serving requests
  Active nodes: 2
  Check Configurations: 0
    - Dispatched: 0
    - Unassigned: 0

Admission Controller
====================
  Disabled: The admission controller is not enabled on the Cluster Agent

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 12
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 3, Total: 19
      Service Checks: Last Run: 3, Total: 30
      Average Execution Time : 1.768s
      Last Execution Date : 2020-07-24 00:06:28.000000 UTC
      Last Successful Execution Date : 2020-07-24 00:06:28.000000 UTC

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 11
    Connections: 0
    Containers: 0
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 4
    Metadata: 0
    Pods: 0
    Processes: 0
    RTContainers: 0
    RTProcesses: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 26
    TimeseriesV1: 11

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with: [X]

Describe what happened: The cluster-agent events show that the readiness and liveness probe fail with statuscode 500 despite all healthchecks passing.

Describe what you expected: The liveness and readiness probe should be passing.

Steps to reproduce the issue: Start the cluster-agent and make a request to the /ready and /live endpoint.

curl -v 127.0.0.1:5555/ready
*   Trying 127.0.0.1:5555...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 5555 (#0)
> GET /ready HTTP/1.1
> Host: 127.0.0.1:5555
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Thu, 23 Jul 2020 23:29:34 GMT
< Content-Length: 241
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host 127.0.0.1 left intact
{"Healthy":["healthcheck","collector-queue","clusterchecks-leadership","clusterchecks-dispatch","aggregator","tagger","ad-servicelistening","ad-config-provider-kubernetes-endpoints","ad-config-provider-kubernetes-services"],"Unhealthy":null}

curl -v 127.0.0.1:5555/live 
*   Trying 127.0.0.1:5555...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 5555 (#0)
> GET /live HTTP/1.1
> Host: 127.0.0.1:5555
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Thu, 23 Jul 2020 23:30:37 GMT
< Content-Length: 241
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host 127.0.0.1 left intact
{"Healthy":["healthcheck","clusterchecks-dispatch","aggregator","tagger","ad-servicelistening","ad-config-provider-kubernetes-endpoints","ad-config-provider-kubernetes-services","collector-queue","clusterchecks-leadership"],"Unhealthy":null}

As seen above, it returns a 200 code but running kubectl describe pod datadog-cluster-agent yields the following under events:

Readiness probe failed: HTTP probe failed with statuscode: 500
Liveness probe failed: HTTP probe failed with statuscode: 500

And running kubectl logs datadog-cluster-agent returns this particular log:

2020-07-23 23:22:16 UTC | CLUSTER | DEBUG | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]

Additional environment details (Operating System, Cloud provider, etc):

haritsE commented 2 years ago

I am facing the same issue, is there any update on this?

guitarrapc commented 2 years ago

I also see this issue with latest Helm chart 2.30.16. Any clue with?