Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus

Vijaygawate commented 3 months ago

Ask your question

I have installed prometheus stack, prometheus adapter and dcgm exporter, but when i am trying to get this metrics it is giving below error

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq . Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods

What I am doing, I have 2 node groups in EKS, one is normal EC2 instance group which doesnt have GPUs, and on this node I have installed prometheus stack and prometheus adapter and I have GPU node group on which I have installed dcgm exporter.

Is this is due to this? means I should install all components on GPU node only then it will work?

nvvfedorov commented 3 months ago

The DCGM Exporter reads metrics from the GPU Node where it's installed. Please start troubleshooting from the DCGM exporter by making an HTTP call to the DCGM Exporter's /metrics endpoint.

Vijaygawate commented 3 months ago

Hello @nvvfedorov I have tried above and I am getting that metrics in DCGM exporter, but when I am trying to run below command kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq .

It is saying metrics not available, also in hpa as well it says no metrics or invalid metrics

NVIDIA / dcgm-exporter

Getting "Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods",I am not getting DCGM_FI_DEV_GPU_UTIL metrics from prometheus #379

Ask your question