Closed jxh314 closed 5 months ago
The 'dcgmi' client command can connect to an 'nv-hostengine' server running on another host with the '--host' parameter. The 'nv-hostengine' process must be running and listening on the remote host/port for this to work. The code for this is in the open-source DCGM repo.
The error you are seeing is most likely caused by no nv-hostengine running on 10.112.220.8
Please note that by default, nv-hostengine only binds to 127.0.0.1, so it won't be listening for remote connections. If you want to listen for remote connections, you'll need to use the -b option when starting nv-hostengine to specify the IP address you want it to listen for connections on. You may also specify -b ALL to have it listen on all network interfaces.
Hello!Can DCGM achieve obtaining gpu information from one host to another? When I running the command
dcgmi discovery --host 10.112.220.8 -l
to obtain another host's(DCGM service started) information, it failed with the following promptIs there any other configuration to be done or something wrong with the IP address? Also, the introduction states that DCGM is used for cluster management, does there exist code for communication between nodes or gpus in open source libraries? Thanks a lot.