Open graywen24 opened 2 years ago
not able to find any information when google this... this version just updated Feb 2022 and guess no one use this feature to monitor...
@graywen24,
Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.
@graywen24,
Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.
thanks.. but we dont use k8s cluster and only run offline training on single GPU node... if install dcgm-exporter will be a very heavy process for the node. While node-exporter cant not have gpu monitoring metric..
@graywen24,
dcgm-exporter may work outside of the k8s environment, and in general, that's just a small binary written in Go. If the DCGM is installed on the machine, you do not need to use the dcgm-exporter docker image (just the dcgm-exporter binary) because the libdcgm.so that will be already on the machine.
we follow doc here https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/integrating-with-dcgm.html#starting-prometheus-client
and looks like the new version of datacenter-gpu-manager has issue for this script:
already install datacenter-gpu-manager Version table: 1:2.3.4 600