NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Can the dcgm exporter be run in two containers on a physical machine together with other programs that call the dcgm api? #106

Closed xcode03 closed 11 months ago

xcode03 commented 12 months ago

On a GPU physical machine, there are two containers: dcgm-exporter and the dcgm api program. When running the latter, I received the following error:

"error":"failed to watch field: Error watching fields: The third-party Profiling module returned an unrecoverable error"

Is there a way for both to coexist?

glowkey commented 12 months ago

This configuration is documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-a-dcgm-standalone-container. It is possible by having the dcgm-exporter container connect to the DCGM nv-hostengine/API container.

xcode03 commented 11 months ago

In addition, would you like to ask whether configuring this parameter on a100 will affect dcgm collection? @glowkey

# /etc/modprobe.d/ncu.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0
nikkon-dev commented 11 months ago

@xcode03,

That parameter controls whether collecting DCP metrics (1001-1014) on GV100/GA100 GPUs would require root permissions. DCP metrics on architectures before GH100 are using the profiling capabilities. If you set NVreg_RestrictProfilingToAdminUsers=0, nv-hostengine does not need root privileges to collect DCP metrics.