DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
346 stars 1.01k forks source link

Kubernetes State Metrics Core does not work in Datadog 3.30.5 #1056

Closed elldritch closed 1 year ago

elldritch commented 1 year ago

Describe what happened:

We upgraded to Datadog 3.30.5 this morning and experienced an issue where kubernetes_state.* metrics suddenly stopped working. In the kubectl logs of the cluster-agent pod, we found:

root@dev-tools-datadog-cluster-agent-74d44c8545-7nkfc:/# grep ERR /var/log/datadog/cluster-agent.log 
2023-05-23 18:33:57 UTC | CLUSTER | ERROR | (pkg/collector/corechecks/loader.go:72 in Load) | core.loader: could not configure check kubernetes_state_core: resource apiservices does not exist. Available resources: secrets,statefulsets,services,verticalpodautoscalers,certificatesigningrequests,daemonsets,namespaces,replicationcontrollers,pods,storageclasses,endpoints,networkpolicies,nodes,persistentvolumeclaims,replicasets,clusterroles,configmaps,clusterrolebindings,leases,ingresses,poddisruptionbudgets,volumeattachments,rolebindings,validatingwebhookconfigurations,horizontalpodautoscalers,limitranges,persistentvolumes,resourcequotas,mutatingwebhookconfigurations,roles,serviceaccounts,cronjobs,deployments,ingressclasses,jobs
2023-05-23 18:33:57 UTC | CLUSTER | ERROR | (pkg/collector/scheduler.go:201 in getChecks) | Unable to load a check from instance of config 'kubernetes_state_core': Core Check Loader: Could not configure check kubernetes_state_core: resource apiservices does not exist. Available resources: secrets,statefulsets,services,verticalpodautoscalers,certificatesigningrequests,daemonsets,namespaces,replicationcontrollers,pods,storageclasses,endpoints,networkpolicies,nodes,persistentvolumeclaims,replicasets,clusterroles,configmaps,clusterrolebindings,leases,ingresses,poddisruptionbudgets,volumeattachments,rolebindings,validatingwebhookconfigurations,horizontalpodautoscalers,limitranges,persistentvolumes,resourcequotas,mutatingwebhookconfigurations,roles,serviceaccounts,cronjobs,deployments,ingressclasses,jobs
2023-05-23 18:33:57 UTC | CLUSTER | ERROR | (pkg/collector/scheduler.go:248 in GetChecksFromConfigs) | Unable to load the check: unable to load any check from config 'kubernetes_state_core'

We suspect that this is related to a change that occurred in https://github.com/DataDog/helm-charts/compare/datadog-3.30.4...datadog-3.30.5, specifically https://github.com/DataDog/helm-charts/pull/1055.

We were able to resolve this issue by downgrading to 3.30.4, although we weren't able to root cause this error.

Describe what you expected:

I expected the kubernetes_state.* metrics to continue to work.

Steps to reproduce the issue:

From 3.30.4, we ran helm upgrade datadog datadog/datadog --reuse-values.

Additional environment details (Operating System, Cloud provider, etc):

We run a self-hosted Kubernetes cluster on top of AWS EC2 instances.

$ kubectl version --short
Client Version: v1.27.1
Kustomize Version: v5.0.1
Server Version: v1.25.7

$ helm version --short
v3.12.0+gc9f554d
clamoriniere commented 1 year ago

Hi @elldritch

Thanks for opening this issue, we are starting our investigation.

In the meantime, could you share which agent version you are using? 🙇

emmanueljourdan commented 1 year ago

same problem here, our cluster agent's version is:

Cluster Agent 7.44.1 - Commit: 299bdcd - Serialization version: v5.0.76 - Go version: go1.19.8