D34DC3N73R / netdata-glibc

netdata with glibc package for use with nvidia-docker2
GNU General Public License v3.0
21 stars 4 forks source link

libnvidia-ml.so #4

Closed oamster closed 1 year ago

oamster commented 2 years ago

Having trouble getting netdata to work with nvidia. I am able to run nvidia-smi on the host machine (openmediavault), as well as another docker container (plex media server). I was getting the same error in the plex container as netdata, editing config.toml to use ldconfig = "/sbin/ldconfig.real" fixed the issue with plex, and doesn't help netdata.

Here's my kernal version and docker version: Linux 5.10.0-0.bpo.9-amd64 #1 SMP Debian 5.10.70-1~bpo10+1 (2021-10-10) x86_64 GNU/Linux

Client: Docker Engine - Community Version: 20.10.12 API version: 1.41 Go version: go1.16.12 Git commit: e91ed57 Built: Mon Dec 13 11:45:37 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.12 API version: 1.41 (minimum version 1.12) Go version: go1.16.12 Git commit: 459d0df Built: Mon Dec 13 11:43:46 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.12 GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d nvidia: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0

I'm getting this error when running nvidia-smi in the container:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

As well as error's like this in the error log:

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_gpu_exporter_local] Get "http://127.0.0.1:9445/metrics": dial tcp 127.0.0.1:9445: connect: connection refused

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_gpu_exporter_local] check failed

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_smi_exporter_local] Get "http://127.0.0.1:9454/metrics": dial tcp 127.0.0.1:9454: connect: connection refused

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_smi_exporter_local] check failed

2022-01-13 21:05:35: python.d INFO: plugin[main] : [nvidia_smi] built 1 job(s) configs

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/bin/nvidia-smi' (disk '_usr_bin_nvidia-smi', filesystem 'ext4', root '/usr/lib/nvidia/current/nvidia-smi') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/bin/nvidia-debugdump' (disk '_usr_bin_nvidia-debugdump', filesystem 'ext4', root '/usr/lib/nvidia/current/nvidia-debugdump') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libnvidia-ml.so.460.73.01' (disk '_usr_lib64_libnvidia-ml.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.73.01') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libcuda.so.460.73.01' (disk '_usr_lib64_libcuda.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.73.01') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libnvidia-ptxjitcompiler.so.460.73.01' (disk '_usr_lib64_libnvidia-ptxjitcompiler.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.73.01') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidiactl' (disk '_dev_nvidiactl', filesystem 'devtmpfs', root '/nvidiactl') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia-uvm' (disk '_dev_nvidia-uvm', filesystem 'devtmpfs', root '/nvidia-uvm') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia-uvm-tools' (disk '_dev_nvidia-uvm-tools', filesystem 'devtmpfs', root '/nvidia-uvm-tools') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia0' (disk '_dev_nvidia0', filesystem 'devtmpfs', root '/nvidia0') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:06:06: python.d ERROR: nvidia_smi[nvidia_smi] : xml parse failed: "b"NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.\nPlease also try adding directory that contains libnvidia-ml.so to your system PATH.\n"", error: syntax error: line 1, column 0

2022-01-13 21:06:06: python.d INFO: plugin[main] : nvidia_smi[nvidia_smi] : check failed

D34DC3N73R commented 2 years ago

I haven't tested or run openmediavault before, but this sounds kind of similar to issue #3 Does it work if you run docker exec netdata bash -c 'LDCONFIG=$(find /usr/lib64/ -name libnvidia-ml.so.*) nvidia-smi'

oamster commented 2 years ago

Here's the output,

~# docker exec netdata bash -c 'LDCONFIG=$(find /usr/lib64/ -name libnvidia-ml.so.*) nvidia-smi' NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.

My libnvidia on the host machine is in: /usr/lib/x86_64-linux-gnu/

Not sure if that's the reason it's not working. But my other containers are working fine with it. Right now I've resorted to grafana.

D34DC3N73R commented 2 years ago

/usr/lib/x86_64-linux-gnu/ is also where libnvidia is on my host system as well (ubuntu 20.04). But in the container, it should be in /usr/lib64/. What steps did you take to install nvidia container toolkit as well as the nvidia drivers?

Edit: I also found this in regards to OMV + Nvidia https://forum.openmediavault.org/index.php?thread/40883-nvidia-working-with-omv-6/

Also see this if you're running OMV 5 https://forum.openmediavault.org/index.php?thread/39413-nvidia-smi-couldn-t-find-libnvidia-ml-so-library-in-your-system-please-make-sure/

oamster commented 2 years ago

I had actually used this guide to set everything up, the drivers as well as installing the nvidia tool kit. https://forum.openmediavault.org/index.php?thread/38013-howto-nvidia-hardware-transcoding-on-omv-5-in-a-plex-docker-container/

I removed and reinstalled drivers, but did not remove /usr/lib/x86_64-linux-gnu/ and anything in that directory manually. Maybe I should give that a try. Just strange that everything else works with the GPU, just not the official netdata image, or yours.

Edit: Maybe it's an issues with /etc/nvidia-container-runtime/config.toml. As mine is:

ldconfig = "@/sbin/ldconfig"

ldconfig = "/sbin/ldconfig"

ldconfig = "/sbin/ldconfig.real"

Edit: But plex and other containers error when setting it ldconfig to anything other than ldconfig.real.

D34DC3N73R commented 2 years ago

config.toml is the default

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

Did reinstalling help at all?

oamster commented 2 years ago

Tried reinstalling, didn't help. Changed my config.toml to ldconfig = "@/sbin/ldconfig and getting this error when deploying the container:

OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown

No error when using ldconfig = "/sbin/ldconfig.real" but still get the python.d error.

I resorted to using prometheus, nvidia smi exporter and grafana which works. But still cannot get it to work with netdata.

cryptoDevTrader commented 1 year ago

Any update/progress on this? I'm having the same exact issue on OMV 6.

cryptoDevTrader commented 1 year ago

I got it to work. I followed the following guide on OMV 6 to install nvidia-drivers and nvidia-docker2.

https://forum.openmediavault.org/index.php?thread/31206-how-to-setup-nvidia-in-plex-docker-for-hardware-transcoding/

It indicates that ldconfig should be set to /sbin/ldconfig.real in /etc/nvidia-container-runtime/config.toml. Leaving this set to @/sbin/ldconfig (the default after I installed) works for both the Plex container and netdata.

cryptoDevTrader commented 1 year ago

Note that I also downgraded nvidia packages as per this post. Using up to date nvidia packages causes the plex container to not work with the configuration noted above. The netdata-glibc container does work.

https://forums.developer.nvidia.com/t/issue-with-setting-up-triton-on-jetson-nano/248485/2

D34DC3N73R commented 1 year ago

@cryptoDevTrader you may also want to give the dev image & instructions a try. We'll be moving to that with the next netdata release. image: d34dc3n73r/netdata-glibc:dev instructions: https://github.com/D34DC3N73R/netdata-glibc/tree/dev

When the official release happens you'll have to change the image to :stable or :latest depending on your preference.

cryptoDevTrader commented 1 year ago

@cryptoDevTrader you may also want to give the dev image & instructions a try. We'll be moving to that with the next netdata release. image: d34dc3n73r/netdata-glibc:dev instructions: https://github.com/D34DC3N73R/netdata-glibc/tree/dev

When the official release happens you'll have to change the image to :stable or :latest depending on your preference.

This was hugely helpful!

I am running both netdata-glibc and plex via docker-compose. netdata-glibc was already working properly with the previous config using the NVIDIA_VISIBLE_DEVICES env and nvidia runtime. Plex, however, was not working with the same configuration and the latest version of nvidia packages (older versions worked fine). Upgrading the nvidia packages to the latest versions and using the deploy method described in the dev branch worked for both deployments.

D34DC3N73R commented 1 year ago

closing this, but feel free to reopen if it can be reproduced with the newest updates.