NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
413 stars 55 forks source link

1:2.3.4 version dcgm_prometheus.py error AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds' #22

Open graywen24 opened 2 years ago

graywen24 commented 2 years ago

we follow doc here https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/integrating-with-dcgm.html#starting-prometheus-client

and looks like the new version of datacenter-gpu-manager has issue for this script:

python3 dcgm_prometheus.py -e Traceback (most recent call last): File "dcgm_prometheus.py", line 264, in main() File "dcgm_prometheus.py", line 257, in main prometheus_obj.LogBasicInformation() File "dcgm_prometheus.py", line 142, in LogBasicInformation for fieldId in self.m_publishFieldIds: AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'

already install datacenter-gpu-manager Version table: 1:2.3.4 600

graywen24 commented 2 years ago

not able to find any information when google this... this version just updated Feb 2022 and guess no one use this feature to monitor...

nikkon-dev commented 2 years ago

@graywen24,

Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.

graywen24 commented 2 years ago

@graywen24,

Unfortunately, the dcgm_prometheus.py is not actively supported and is rather an example. We have the dcgm-exporter project that is meant to provide Prometheus metrics and is actively supported.

thanks.. but we dont use k8s cluster and only run offline training on single GPU node... if install dcgm-exporter will be a very heavy process for the node. While node-exporter cant not have gpu monitoring metric..

nikkon-dev commented 2 years ago

@graywen24,

dcgm-exporter may work outside of the k8s environment, and in general, that's just a small binary written in Go. If the DCGM is installed on the machine, you do not need to use the dcgm-exporter docker image (just the dcgm-exporter binary) because the libdcgm.so that will be already on the machine.