DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
337 stars 1.01k forks source link

Health probe failed to restart agent pods #1339

Closed sneako closed 6 months ago

sneako commented 6 months ago

Describe what happened: During this recent incident https://status.datadoghq.com/incidents/q2d98y2qv54j many of our agent pods became unready. The unready state persisted until we manual triggered a rollout restart on our daemonsets.

Describe what you expected: Health probes should have restarted the pods, no manual intervention should have been required.

ZeroDeth commented 6 months ago

We had the same issue and resolved it by updating our Datadog Helm chart to the latest version. https://github.com/DataDog/helm-charts/releases/tag/datadog-3.57.3

sneako commented 6 months ago

Do you mean you were already on the latest version before the incident, and observed the health probes restart the pods during this incident, without any manual intervention? Otherwise, updating the chart would probably also cause the pods to restart, which is what fixed it in our case.

alemuro commented 6 months ago

Hello 👋 we have the same issue.

The /live endpoint returns a 200 whereas the /ready returns a 500. Because of this, the container is not restarted cause the livenessProbe returns a success.

I think the livenessProbe should return an error if there is an issue (like the API validation error), like the readynessProbe endpoint does.

Thanks!

cilindrox commented 6 months ago

hitting the same issue (readiness returns 500) - we've been running the latest version 3.57.3 of the chart since its release.

tbavelier commented 6 months ago

Hello everyone,

Thank you for the reports! This issue is already tracked in https://github.com/DataDog/datadog-agent/issues/23506#issuecomment-1984610009 which mentions :

We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.

Closing this in favour of the datadog-agent issue, as revisiting the logic of these /live and /ready endpoints will also most likely impact the liveness and readiness probes.