Closed laszlocph closed 2 months ago
@laszlocph I ran into the same issues and raised an issue with the DCGM exporter: https://github.com/NVIDIA/dcgm-exporter/issues/353
@laszlocph maybe you'd like to a take a look at my dashboard PR - https://github.com/NVIDIA/dcgm-exporter/pull/355 Depending on the result of https://github.com/NVIDIA/dcgm-exporter/issues/353 (de-duplication of metrics due to e.g. MIG) I might be able to remove some of the aggregations again.
There now also is a dedicated issue about cleaning up the labels: https://github.com/NVIDIA/dcgm-exporter/issues/356
@frittentheke Amazing. This did it for us. Now we have utilization metrics for MIG partitions 🥳
Hello,
I am kinda in a rabbit hole:
DCGM_FI_DEV_GPU_UTIL
is not supported for MIG devices https://github.com/NVIDIA/DCGM/issues/80#issuecomment-1537603016DCGM_FI_PROF_SM_OCCUPANCY
could be a substitute, but it is disabled by default inkubectl exec -it nvidia-dcgm-exporter-rh46x -- cat /etc/dcgm-exporter/dcp-metrics-included.csv | less
To enable
DCGM_FI_PROF_*
I found this issue, but the refferred piece of documentation is gone: https://github.com/NVIDIA/gpu-operator/issues/275#issuecomment-1323552018Anybody managed to monitor MIG devices memory utilization? Anybody managed to configure custom metrics for dgcm-exporter?
Thank you.