Open Vijaygawate opened 3 months ago
The DCGM Exporter reads metrics from the GPU Node where it's installed. Please start troubleshooting from the DCGM exporter by making an HTTP call to the DCGM Exporter's /metrics endpoint.
Hello @nvvfedorov I have tried above and I am getting that metrics in DCGM exporter, but when I am trying to run below command kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq .
It is saying metrics not available, also in hpa as well it says no metrics or invalid metrics
Ask your question
I have installed prometheus stack, prometheus adapter and dcgm exporter, but when i am trying to get this metrics it is giving below error
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/DCGM_FI_DEV_GPU_UTIL" | jq . Error from server (NotFound): the server could not find the metric DCGM_FI_DEV_GPU_UTIL for pods
What I am doing, I have 2 node groups in EKS, one is normal EC2 instance group which doesnt have GPUs, and on this node I have installed prometheus stack and prometheus adapter and I have GPU node group on which I have installed dcgm exporter.
Is this is due to this? means I should install all components on GPU node only then it will work?