Open roy-work opened 4 years ago
@roy-work Would you be able to upgrade the agent to the latest version, and also take advantage of the HTTP livenessProbe we've added (example here, which will require the DD_HEALTH_PORT
env var as well)? If this doesn't solve the CrashLoopBackOff could you open a support ticket with us for us to investigate?
working through a support ticket at the moment, we'll report back here if any progress is made. we upgraded to 7.17.1 and then to 7.18.0 without much success.
@tonyffrench I'm interested with the results as we're experiencing the same issue (agent v6.14 -> liveness status http 500). We need to investigate more but if you have some infos that could help.
Interested on the resolution here as well.
Update.
For me. the name
was not appearing on helm ls --namespace NAMESPACE
command.
For S's & G's, I ran the delete command: helm3 del <release-name> --namespace <namespace>
and it deleted succesfully.
Then I was able to proceed.
Hello,
version: 2.3.41
same error:
Readiness probe failed: HTTP probe failed with statuscode: 500
Anyone managed to resolve this?
I was having this same issue. Increasing initialDelaySeconds
to 30 in the livenessProbe
configuration seemed to have helped.
Seeing this issue on version 2.37.9. I increased initialDelaySeconds
to 30 as suggested above but the issue came back after a 2 minute delay. Despite the liveness and readiness probe failing, the cluster agent seems to be running ok as I can see my cluster's metrics in Datadog.
Some people seem to be "disabling" the probes: https://github.com/DataDog/datadog-agent/issues/5908#issuecomment-659533683
Logs for cluster agent
2022-10-21 17:56:26 UTC | CLUSTER | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]
2022-10-21 17:56:26 UTC | CLUSTER | INFO | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | Healthcheck failed on: [clusterchecks-dispatch]
I have not yet tried on the latest version of the chart which at time of writing is 3.1.10.
Output of the info page (if this is a bug)
Describe what happened: 2 our of our 5 DD pods keep failing liveness probes, and subsequently get restarted by Kubernetes. The output,
Describe what you expected: No crashes. :)
Steps to reproduce the issue: We're not sure, in particular, we're not sure why it's only these two pods that seem to have so much trouble.
Additional environment details (Operating System, Cloud provider, etc): This is running on an Azure AKS cluster.