Open cp-jennifer opened 4 years ago
Output of the info page (if this is a bug)
Getting the status from the agent. ============================== Datadog Cluster Agent (v1.7.0) ============================== Status date: 2020-07-24 00:06:35.235643 UTC Agent start: 2020-07-24 00:03:40.233598 UTC Pid: 1 Go Version: go1.13.11 Build arch: amd64 Agent flavor: cluster_agent Check Runners: 4 Log Level: DEBUG Paths ===== Config File: /etc/datadog-agent/datadog-cluster.yaml conf.d: /etc/datadog-agent/conf.d Clocks ====== System UTC time: 2020-07-24 00:06:35.235643 UTC Hostnames ========= ec2-hostname: [X] hostname: [X] instance-id: [X] socket-fqdn: datadog-cluster-agent-[X] socket-hostname: datadog-cluster-agent-[X] hostname provider: aws unused hostname providers: configuration/environment: hostname is empty gce: unable to retrieve hostname from GCE: status code 404 trying to GET [X] Metadata ======== Leader Election =============== Leader Election Status: Running Leader Name is: datadog-cluster-agent-[X] Last Acquisition of the lease: Fri, 24 Jul 2020 00:04:01 UTC Renewed leadership: Fri, 24 Jul 2020 00:06:32 UTC Number of leader transitions: 26 transitions Custom Metrics Server ===================== Disabled: The external metrics provider is not enabled on the Cluster Agent Cluster Checks Dispatching ========================== Status: Leader, serving requests Active nodes: 2 Check Configurations: 0 - Dispatched: 0 - Unassigned: 0 Admission Controller ==================== Disabled: The admission controller is not enabled on the Cluster Agent ========= Collector ========= Running Checks ============== kubernetes_apiserver -------------------- Instance ID: kubernetes_apiserver [OK] Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default Total Runs: 12 Metric Samples: Last Run: 0, Total: 0 Events: Last Run: 3, Total: 19 Service Checks: Last Run: 3, Total: 30 Average Execution Time : 1.768s Last Execution Date : 2020-07-24 00:06:28.000000 UTC Last Successful Execution Date : 2020-07-24 00:06:28.000000 UTC ========= Forwarder ========= Transactions ============ CheckRunsV1: 11 Connections: 0 Containers: 0 Dropped: 0 DroppedOnInput: 0 Events: 0 HostMetadata: 0 IntakeV1: 4 Metadata: 0 Pods: 0 Processes: 0 RTContainers: 0 RTProcesses: 0 Requeued: 0 Retried: 0 RetryQueueSize: 0 Series: 0 ServiceChecks: 0 SketchSeries: 0 Success: 26 TimeseriesV1: 11 ========== Endpoints ========== https://app.datadoghq.com - API Key ending with: [X]
Describe what happened: The cluster-agent events show that the readiness and liveness probe fail with statuscode 500 despite all healthchecks passing.
Describe what you expected: The liveness and readiness probe should be passing.
Steps to reproduce the issue: Start the cluster-agent and make a request to the /ready and /live endpoint.
curl -v 127.0.0.1:5555/ready * Trying 127.0.0.1:5555... * TCP_NODELAY set * Connected to 127.0.0.1 (127.0.0.1) port 5555 (#0) > GET /ready HTTP/1.1 > Host: 127.0.0.1:5555 > User-Agent: curl/7.68.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < Date: Thu, 23 Jul 2020 23:29:34 GMT < Content-Length: 241 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host 127.0.0.1 left intact {"Healthy":["healthcheck","collector-queue","clusterchecks-leadership","clusterchecks-dispatch","aggregator","tagger","ad-servicelistening","ad-config-provider-kubernetes-endpoints","ad-config-provider-kubernetes-services"],"Unhealthy":null} curl -v 127.0.0.1:5555/live * Trying 127.0.0.1:5555... * TCP_NODELAY set * Connected to 127.0.0.1 (127.0.0.1) port 5555 (#0) > GET /live HTTP/1.1 > Host: 127.0.0.1:5555 > User-Agent: curl/7.68.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < Date: Thu, 23 Jul 2020 23:30:37 GMT < Content-Length: 241 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host 127.0.0.1 left intact {"Healthy":["healthcheck","clusterchecks-dispatch","aggregator","tagger","ad-servicelistening","ad-config-provider-kubernetes-endpoints","ad-config-provider-kubernetes-services","collector-queue","clusterchecks-leadership"],"Unhealthy":null}
As seen above, it returns a 200 code but running kubectl describe pod datadog-cluster-agent yields the following under events:
kubectl describe pod datadog-cluster-agent
Readiness probe failed: HTTP probe failed with statuscode: 500 Liveness probe failed: HTTP probe failed with statuscode: 500
And running kubectl logs datadog-cluster-agent returns this particular log:
kubectl logs datadog-cluster-agent
2020-07-23 23:22:16 UTC | CLUSTER | DEBUG | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]
Additional environment details (Operating System, Cloud provider, etc):
I am facing the same issue, is there any update on this?
I also see this issue with latest Helm chart 2.30.16. Any clue with?
Output of the info page (if this is a bug)
Describe what happened: The cluster-agent events show that the readiness and liveness probe fail with statuscode 500 despite all healthchecks passing.
Describe what you expected: The liveness and readiness probe should be passing.
Steps to reproduce the issue: Start the cluster-agent and make a request to the /ready and /live endpoint.
As seen above, it returns a 200 code but running
kubectl describe pod datadog-cluster-agent
yields the following under events:And running
kubectl logs datadog-cluster-agent
returns this particular log:Additional environment details (Operating System, Cloud provider, etc):