Mirantis / cri-dockerd

dockerd as a compliant Container Runtime Interface for Kubernetes
Apache License 2.0
1.05k stars 281 forks source link

Missing old metrics in cadvisor #127

Open orlandoalexandrescu opened 1 year ago

orlandoalexandrescu commented 1 year ago

After doing a kubernetes upgrade from v1.23 to v1.24, some metric labels have gone missing from a grafana dashboard. We are using cri-dockerd for docker CRI.

Previously (kubernetes v1.23.3)

container_memory_swap{container="kube-scheduler",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf3da42828c928b12552def61111bf21f.slice/docker-194b27fff8dbca592c0d8cbb22bb79bd8fe159ce5329187f4c27b774d46a244b.scope",image="sha256:a4183b88f6e65972c4b176b43ca59de31868635a7e43805f4c6e26203de1742f",name="k8s_kube-scheduler_kube-scheduler-master-worker-07b03d19518164ffb42f_kube-system_f3da42828c928b12552def61111bf21f_0",namespace="kube-system",pod="kube-scheduler-master-worker-07b03d19518164ffb42f"} 0 1669121874089

Currently (kubernetes v1.24.6)

container_memory_swap{container="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode52d0cf9b11e9c1a9b69391aa1c91ac3.slice",image="",name="",namespace="kube-system",pod="kube-scheduler-master-worker-91ea15c226fe61b78e7e"} 0 1669122010125

Notice how the "container" label for this metric changed from a value to an empty string.

We have currently upgraded to the following components/versions:

Kubernetes: v1.24.6 Docker: 20.10.18 cri-dockerd: 0.2.6 (d8accf7) cadvisor: v0.44.1

The commands we used to get the metrics were: NODE_NAME=$(kubectl get no --no-headers | cut -f 1 -d ' ') kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/metrics/cadvisor | grep container_memory_swap | grep kube-scheduler

We found a very similar issue on the rancher boards:

https://github.com/rancher/rancher/issues/38934

If I can do anything to help with information further, please let me know.

evol262 commented 1 year ago

The rationale and workaround here are very clear. We could document this, but kube-prometheus-stack is also a reasonable option.

For the sake of argument, let's imagine that we reach out to upstream and they choose not to re-enable the plugin in core k8s? What outcome would you prefer?

Starting a separate cadvisor stack (not integrated into the "normal" cadvisor in any way unless we also try to wait for k8s to start up and try to inject a ServiceMonitor CRD assuming it's kube-prometheus-stack or provide another mechanism of discovery) is really overkill for a CRI.