NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
378 stars 50 forks source link

E: Unable to locate package datacenter-gpu-manager #46

Closed cyLi-Tiger closed 2 years ago

cyLi-Tiger commented 2 years ago

I follow the install instruction here. But find an error saying that unable to find the package.

My commands are followed:

  1. apt-key del 7fa2af80
  2. distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/.//g')
  3. wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
  4. dpkg -i cuda-keyring_1.0-1_all.deb
  5. apt-get update
  6. apt-get install -y datacenter-gpu-manager

On step 6 I met error "E: Unable to locate package datacenter-gpu-manager"

I wonder how to fix this problem?

nikkon-dev commented 2 years ago

What is your $distribution value?

cyLi-Tiger commented 2 years ago

ubuntu1804

nikkon-dev commented 2 years ago

Could you check if adding the repository manually would work for you?

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
cyLi-Tiger commented 2 years ago

While waiting for your reply, I found another way to install DCGM using the method introduced in triton Model Analyzer's install.md.

Now I can use CLI dcgmi but I met another mistake, while I was running dcgmi discovery -lto check my gpu. I got Error: unable to establish a connection to the specified host: localhost. Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

I wonder how to fix that problem?

nikkon-dev commented 2 years ago

@cyLi-Tiger, I hope you have not used the same 2.0.13 version as in the Model Analyzer's install.md. That's an old version. The latest version is 2.4.6 This approach has a drawback: You will not automatically get newer versions via apt.

As for the error, you have not started the nvidia-dcgm service, so the nv-hostengine process is not running. The dcgmi command is just a CLI for the nv-hostengine and needs its server part to operate.

cyLi-Tiger commented 2 years ago

@cyLi-Tiger, I hope you have not used the same 2.0.13 version as in the Model Analyzer's install.md. That's an old version. The latest version is 2.4.6 This approach has a drawback: You will not automatically get newer versions via apt.

As for the error, you have not started the nvidia-dcgm service, so the nv-hostengine process is not running. The dcgmi command is just a CLI for the nv-hostengine and needs its server part to operate.

@nikkon-dev I noticed that version issue and I have installed the latest one.

The nvidia-dcgm service starting command I found in DCGM document needs systemctl to start, but I am using docker, it seems that docker don't support systemctl command, is there any other ways to start nvidia-dcgm service, such as service nvidia-dcgm start?

nikkon-dev commented 2 years ago

For dockerized environment, you will have to start nv-hostengine manually. Here is an example: Dockerfile There is also a ready-made Docker image for DCGM: nvidia/dcgm

Please remember that there should not be two nv-hostengine instances accessing the same hardware. This means you should not start two docker containers with nv-hostengine running.

cyLi-Tiger commented 2 years ago

For dockerized environment, you will have to start nv-hostengine manually. Here is an example: Dockerfile There is also a ready-made Docker image for DCGM: nvidia/dcgm

Please remember that there should not be two nv-hostengine instances accessing the same hardware. This means you should not start two docker containers with nv-hostengine running.

Thanks for your reply. I have successfully used DCGM to monitor my GPU refer to dcgm_monitor in Model Analyzer.

But I still have 3 questions about DCGM, I noticed that there are field identifiers like memory utilization and gpu utilization.

  1. How these methods are calculated?
  2. What if I want to monitor the running status of each core of GPU, is there any api for that?
  3. Is there any way to monitor the highest GPU memory use during a time period? I only have a toy plan: collect multiple records during a time period and return the maximum among them.
image
nikkon-dev commented 2 years ago

The last question is addressed in the #48