D34DC3N73R / netdata-glibc

netdata with glibc package for use with nvidia-docker2
GNU General Public License v3.0
21 stars 4 forks source link

Can't run nvidia-smi in container #3

Closed mathieu-b closed 4 years ago

mathieu-b commented 4 years ago

Hello

first of all, thanks for figuring out a way to have NVIDIA GPU benchmarking working by just extending the base netdata image :pray:

I followed the instructions as reported on the DockerHub page. I can start the container , and then access the webserver running at :19999. However, I can't see any section hinting at a GPU / nvidia-smi benchmarking.

Not seeing any stats, I thought that maybe there was some issue with the execution of nvidia-smi (if they use it internally in netdata).

I tried executing nvidia-smi in the container:

docker exec netdata  nvidia-smi

but received this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

The only way that I found for having nvidia-smi successfully executing via docker exec was the following:

docker exec netdata bash -c 'LD_PRELOAD=$(find /usr/lib64/ -name "libnvidia-ml.so.*")  nvidia-smi'

based on this StackOverflow answer

Any clues about how this issue could be solved?

Maybe I'll try to give a peek at netdata's sources to see if I can "patch" the system (supposing that the solution is indeed using LD_PRELOAD).

Best regards.

Best regards.

D34DC3N73R commented 4 years ago

Have you installed nvidia drivers on the host system? If so, how did you accomplish that? (There are a couple of ways, but I'd recommend adding the graphics-drivers ppa). Can you execute nvidia-smi on the host system? Have you installed the nvidia-container-toolkit? Are you using docker run or docker-compose?

mathieu-b commented 4 years ago

Hi

Here goes some info:

Docker engine version:

$ docker --version
Docker version 18.06.2-ce, build 6d37f41

nvidia-smi on host machine:

$ nvidia-smi
Tue Nov 12 13:10:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   115W / 250W |   3439MiB / 10989MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+

Docker runtime:

$ docker info | grep "Runtime"
Runtimes: nvidia runc
Default Runtime: nvidia

nvidia-smi in container:

$ docker container run nvidia/cuda:10.1-devel-ubuntu16.04 nvidia-smi
Tue Nov 12 12:14:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   113W / 250W |   3439MiB / 10989MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The system was installed and configured by another person, however what I know is:

I see that in the main page of the GitHub repository, NVIDIA seems to have updated their "main" instructions for a more recent version of the Docker Engine, and it looks like they deprecated these "old" instructions:

Maybe a newer version / updated installation might fix the issue...

Regards

D34DC3N73R commented 4 years ago

It does seem similar to this issue raised on the nvidia-docker package: https://github.com/NVIDIA/nvidia-docker/issues/854

I'd recommend updating docker, nvidia drivers, and nvidia-docker/nvidia-docker-toolkit. If you're using docker run, a separate runtime is not required since docker v19.03. See the Docker 19.03 + nvidia-container-toolkit example.

mathieu-b commented 4 years ago

I see, thanks for the heads-up. I'm not sure how soon I'll be able to test the newer version and instructions. If that happens, I'll try to report back in this thread.

Regards

D34DC3N73R commented 4 years ago

going to close this issue but feel free to open up another if you have troubles after updating.