Closed cyLi-Tiger closed 2 years ago
What is your $distribution
value?
ubuntu1804
Could you check if adding the repository manually would work for you?
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
While waiting for your reply, I found another way to install DCGM using the method introduced in triton Model Analyzer's install.md.
Now I can use CLI dcgmi but I met another mistake, while I was running dcgmi discovery -l
to check my gpu. I got
Error: unable to establish a connection to the specified host: localhost. Error: Unable to connect to host engine. Host engine connection invalid/disconnected.
I wonder how to fix that problem?
@cyLi-Tiger, I hope you have not used the same 2.0.13 version as in the Model Analyzer's install.md. That's an old version. The latest version is 2.4.6 This approach has a drawback: You will not automatically get newer versions via apt.
As for the error, you have not started the nvidia-dcgm service, so the nv-hostengine process is not running. The dcgmi command is just a CLI for the nv-hostengine and needs its server part to operate.
@cyLi-Tiger, I hope you have not used the same 2.0.13 version as in the Model Analyzer's install.md. That's an old version. The latest version is 2.4.6 This approach has a drawback: You will not automatically get newer versions via apt.
As for the error, you have not started the nvidia-dcgm service, so the nv-hostengine process is not running. The dcgmi command is just a CLI for the nv-hostengine and needs its server part to operate.
@nikkon-dev I noticed that version issue and I have installed the latest one.
The nvidia-dcgm service starting command I found in DCGM document needs systemctl to start, but I am using docker, it seems that docker don't support systemctl command, is there any other ways to start nvidia-dcgm service, such as service nvidia-dcgm start?
For dockerized environment, you will have to start nv-hostengine
manually.
Here is an example: Dockerfile
There is also a ready-made Docker image for DCGM: nvidia/dcgm
Please remember that there should not be two nv-hostengine instances accessing the same hardware. This means you should not start two docker containers with nv-hostengine running.
For dockerized environment, you will have to start
nv-hostengine
manually. Here is an example: Dockerfile There is also a ready-made Docker image for DCGM: nvidia/dcgmPlease remember that there should not be two nv-hostengine instances accessing the same hardware. This means you should not start two docker containers with nv-hostengine running.
Thanks for your reply. I have successfully used DCGM to monitor my GPU refer to dcgm_monitor in Model Analyzer.
But I still have 3 questions about DCGM, I noticed that there are field identifiers like memory utilization and gpu utilization.
The last question is addressed in the #48
I follow the install instruction here. But find an error saying that unable to find the package.
My commands are followed:
On step 6 I met error "E: Unable to locate package datacenter-gpu-manager"
I wonder how to fix this problem?