NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.33k stars 250 forks source link

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

Closed bawee closed 8 months ago

bawee commented 8 months ago

Hello, I'm getting a load library failed error as as previous issue (unsure whether related, hence the new issue) when running a nextflow pipeline with docker that uses the nvidia-runtime-toolkit. It seems that the error is only present in the new version of nvidia-runtime-toolkit (1.14.4) but does not occur on an identical computer running version 1.14.3 which I had set up only few days prior.

Command error: docker: Error response from daemon: failed to create task for container: failed to create a shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. time="2024-01-24T14:05:51Z" level=error msg="error waiting for container: "

nvidia-runtime-toolkit was installed using apt on Ubuntu 22.04.3 following instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

I tried installing the old version instead (sudo apt install nvidia-container-toolkit=1.14.3-1) but it was unsuccessful due to an unavailable dependency.

Thanks in advance!

Originally posted by @bawee in https://github.com/NVIDIA/nvidia-container-toolkit/issues/302#issuecomment-1908427326

elezar commented 8 months ago

I will be able to check what the source of this could be on Monday. For now, you should be able to downgrade by specifying the versions of all packages:

sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
        nvidia-container-toolkit-base=1.14.3-1 \
        libnvidia-container-tools=1.14.3-1 \
        libnvidia-container1=1.14.3-1
elezar commented 8 months ago

@bawee could you provide more information on your setup? How are you running containers? How is the NVIDIA Container Toolkit installed and configured to be used with Docker?

bawee commented 8 months ago

Hi @elezar, Docker was installed as follows:

sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker ${USER}
sudo systemctl restart docker

The NVIDIA toolkit was installed as follows with instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Then configured using:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

The containers are run using a nextflow pipeline documented here: https://labs.epi2me.io/workflows/wf-basecalling/

Rolling back to the previous version of nvida-container-toolkit using your previous instructions above did not help. The error still appeared even with v1.14.3-1. My identical machine running v1.14.3 that i set up last week is still working fine

I also tried completely removing docker and reinstalling using sudo apt-get autoremove -y --purge docker.io

I hope that is at all helpful. Please let me know if I need to provide more info. Thank you!

elezar commented 8 months ago

Note that docker.io is listed as a conflicting package here: https://docs.docker.com/engine/install/ubuntu/

Would you be able to install docker-ce instead?

bawee commented 8 months ago

Hi Evan,

Thank you for pointing that out. I had not seen that. I followed the instructions on https://docs.docker.com/engine/install/ubuntu/ and replaced docker.io with docker-ce.

The error message is still the same, unfortunately:

Command error:
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
  time="2024-01-29T18:00:47Z" level=error msg="error waiting for container: context canceled"

Here is some more information:

$ dpkg -l | grep -i docker
ii  docker-buildx-plugin                       0.12.1-1~ubuntu.22.04~jammy             amd64        Docker Buildx cli plugin.
ii  docker-ce                                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                              5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Rootless support for Docker.
ii  docker-compose-plugin                      2.24.2-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.
$ nvidia-container-cli info
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
bawee commented 8 months ago

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Screenshot from 2024-01-29 21-53-56

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

elezar commented 8 months ago

Ah. Thanks. Yes, we should definitely include the output of nvidia-smi -L on the host in our issue template. The NVIDIA driver is required for the container toolkit to function.

Can we close this issue then?

bawee commented 8 months ago

Yes, thank you very much.

mr-ryan-james commented 3 months ago

I had this same issue, btw, and fixed it with the same solution (downloading the drivers based off the docs here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation )

The reason I had this problem is because I thought I had already installed the drivers, because in these CUDA download instructions it is very much implied they are being installed as the last step, and I went with the legacy non open "flavor".

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

But the "download" instructions do not seem to have anything about "sudo apt-get install cuda-drivers-535" from the "cuda installation guide". I am a bit new to CUDA and eager to just get machine learnin' on my newly rented GPU server so I can't say I fully understand the difference between cuda-drivers-535 from the "installation guide" instructions and the "sudo apt-get install -y cuda-drivers" from the CUDA download instructions.

@bawee thanks for posting this question, saved me a lot of time!

amirian commented 3 months ago

Hi @elezar, I am trying to run an application on ubuntu 24.04 which needs nvidia container but stopped having the same issue. I have no nvidia hardware installed and I wish there be a solution such as gpu simulation.

amirian commented 3 months ago

No comments @elezar? How can I run the application without gpu?

zxdreamer commented 3 months ago

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Screenshot from 2024-01-29 21-53-56

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

thanks, I have meet same question, I slove it use the same way.

Saurav3108 commented 1 month ago

After reinstalling the cuda drivers, cuda toolkit and container toolkit, I was still getting the error. Issue got resolved by just reinstalling docker-ce using sudo apt-get install --reinstall docker-ce