Closed JulesBelveze closed 3 years ago
Hi, Is the dcgm-exporter itself configured to collect that metric? It should be listed in the config csv file. Also, I'd recommend switching to the DCGM_FIPROF* group of metrics to monitor utilization. The metric you are using is deprecated.
Hey @nikkon-dev thanks for your answer and the suggestion, I will switch to this group of metrics.
However, the DCGM_FI_DEV_GPU_UTIL
is indeed listed in the config csv file. And I can actually observe it on my Grafana dashboard. I feel like the issue is only being able to access it from the HPA..
I actually managed to access the DCGM
metrics from the HPA by modifying my dcgm-exporter
Service
and ServiceMonitor
(as suggested here) and this shows up:
>>> kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
"name": "services/DCGM_FI_DEV_GPU_UTIL",
Then the metric can be access from the HPA as an Object
.
Hi guys,
I have a GKE cluster and I am attempting to perform HPA based on GPU consumption. I have successfully installed the DCGM exporter and I can observe the DCGM metrics from Prometheus, Grafana and Stackdriver. However, I am trying to use the
DCGM_FI_DEV_GPU_UTIL
metric for horizontal autoscaling. I can see it available:However, the following yaml file:
leads me to the following error:
I have checked the namespace, tried to access the metric as an
Object
but without success... Any idea what could have gone wrong?Cheers!