NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Can DCGM achieve obtaining gpu information of another host? #133

Closed jxh314 closed 5 months ago

jxh314 commented 11 months ago

Hello!Can DCGM achieve obtaining gpu information from one host to another? When I running the command dcgmi discovery --host 10.112.220.8 -l to obtain another host's(DCGM service started) information, it failed with the following prompt

Error: unable to establish a connection to the specified host: 10.112.220.8 Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

Is there any other configuration to be done or something wrong with the IP address? Also, the introduction states that DCGM is used for cluster management, does there exist code for communication between nodes or gpus in open source libraries? Thanks a lot.

glowkey commented 11 months ago

The 'dcgmi' client command can connect to an 'nv-hostengine' server running on another host with the '--host' parameter. The 'nv-hostengine' process must be running and listening on the remote host/port for this to work. The code for this is in the open-source DCGM repo.

The error you are seeing is most likely caused by no nv-hostengine running on 10.112.220.8

dbeer commented 11 months ago

Please note that by default, nv-hostengine only binds to 127.0.0.1, so it won't be listening for remote connections. If you want to listen for remote connections, you'll need to use the -b option when starting nv-hostengine to specify the IP address you want it to listen for connections on. You may also specify -b ALL to have it listen on all network interfaces.