NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
864 stars 153 forks source link

dcp metrics supports gpu architecture #370

Closed lxzjd closed 1 month ago

lxzjd commented 2 months ago

Ask your question

My environment is running dcgm-exporter, an error is reported:DCP metrics are supported for Volta, Turing or Ampere GPUs architectures only. Does dcgm currently support these architectures, and does Hopper not?

nvvfedorov commented 2 months ago

@lxzjd , Can you tell us about your environment and show output of the `nvidia-smi', also please share error log. Thank you in advance.

lxzjd commented 2 months ago

@nvvfedorov, This is my nvidia-smi output: image

error log: image

My metrics profile default-counters.csv: image

I know nvidia P40 certainly does not support the DCP metrics, but I'm very confused with this error, I tried the nvidia H800 don't have the error log, but curl http://localhost:9400/metrics also can't see the DCP related indicators. I just want to know what architectures dcgm's dcp metrics currently support. image

nvvfedorov commented 2 months ago

@lxzjd, it appears that on the H800 machine, you are using the default configuration for the DCGM-exporter, which does not include DCP metrics. Please update the configuration to use the dcp-metrics-included.csv file instead.

lxzjd commented 1 month ago

@nvvfedorov , thank you very much, my problem is solved, the H800 does get dcp metrics, I confused the default configuration of the k8s startup with the default configuration file of the docker startup. DCP metrics are supported for Volta, Turing or Ampere GPUs architectures only that error log is too easy to misunderstand. Can you change it?