NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

I stopped nvidia-dcgm.service and disabled auto-restart. I still see dcgmi CLI working. Will dcgmi CLI not use nvidia-dcgm.service? #108

Closed vinayburugu closed 11 months ago

vinayburugu commented 11 months ago

stopped nvidia-dcgm.service and disabled auto-restart. I still see dcgmi CLI working. Will dcgmi CLI not use nvidia-dcgm.service? If not, what does dcgmi CLI talk to to get the health status of the gpu?

nikkon-dev commented 11 months ago

@vinayburugu,

If the standalone nv-hostengine is not running, the dcgmi CLI tool can execute an embedded hostengine. However, in this scenario, the hostengine will only have privileges equivalent to the user running the dcgmi command. It's important to note that certain functionalities may not work as they require root privileges.

vinayburugu commented 11 months ago

@nikkon-dev, Can I have a new nv-hostengine in embedded mode and also run the default nv-hostengine on the same instance? Also, will the new embedded host engine have root previleges?

nikkon-dev commented 11 months ago

@vinayburugu,

Having multiple nv-hostengine instances on the same node is highly not recommended and not an officially supported configuration. The embedded hostengine will not communicate, inherit permissions, or synchronize its work with any other nv-hostengine instance, which may lead to deadlocks in the driver.

vinayburugu commented 11 months ago

@nikkon-dev, Thank you for the explanation.

Will the embedded hostengine executed by dcgmi CLI be shutdown once a specific dcgmi command executed? i.e. will it use a new embedded hostengine everytime a dcgmi command is executed?

nikkon-dev commented 11 months ago

@vinayburugu,

Hostengine is loaded as a shared library into the dcgmi process. When the process terminates, the hostengine also stops.