NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 287 forks source link

How to configure dcgm metrics for MIG? #798

Closed laszlocph closed 2 months ago

laszlocph commented 3 months ago

Hello,

I am kinda in a rabbit hole:

Anybody managed to monitor MIG devices memory utilization? Anybody managed to configure custom metrics for dgcm-exporter?

Thank you.

frittentheke commented 2 months ago

@laszlocph I ran into the same issues and raised an issue with the DCGM exporter: https://github.com/NVIDIA/dcgm-exporter/issues/353

frittentheke commented 2 months ago

@laszlocph maybe you'd like to a take a look at my dashboard PR - https://github.com/NVIDIA/dcgm-exporter/pull/355 Depending on the result of https://github.com/NVIDIA/dcgm-exporter/issues/353 (de-duplication of metrics due to e.g. MIG) I might be able to remove some of the aggregations again.

There now also is a dedicated issue about cleaning up the labels: https://github.com/NVIDIA/dcgm-exporter/issues/356

laszlocph commented 2 months ago

@frittentheke Amazing. This did it for us. Now we have utilization metrics for MIG partitions 🥳