Closed bawee closed 8 months ago
I will be able to check what the source of this could be on Monday. For now, you should be able to downgrade by specifying the versions of all packages:
sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
nvidia-container-toolkit-base=1.14.3-1 \
libnvidia-container-tools=1.14.3-1 \
libnvidia-container1=1.14.3-1
@bawee could you provide more information on your setup? How are you running containers? How is the NVIDIA Container Toolkit installed and configured to be used with Docker?
Hi @elezar, Docker was installed as follows:
sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker ${USER}
sudo systemctl restart docker
The NVIDIA toolkit was installed as follows with instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
Then configured using:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
The containers are run using a nextflow pipeline documented here: https://labs.epi2me.io/workflows/wf-basecalling/
Rolling back to the previous version of nvida-container-toolkit using your previous instructions above did not help. The error still appeared even with v1.14.3-1. My identical machine running v1.14.3 that i set up last week is still working fine
I also tried completely removing docker and reinstalling using sudo apt-get autoremove -y --purge docker.io
I hope that is at all helpful. Please let me know if I need to provide more info. Thank you!
Note that docker.io
is listed as a conflicting package here: https://docs.docker.com/engine/install/ubuntu/
Would you be able to install docker-ce
instead?
Hi Evan,
Thank you for pointing that out. I had not seen that. I followed the instructions on https://docs.docker.com/engine/install/ubuntu/ and replaced docker.io with docker-ce.
The error message is still the same, unfortunately:
Command error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
time="2024-01-29T18:00:47Z" level=error msg="error waiting for container: context canceled"
Here is some more information:
$ dpkg -l | grep -i docker
ii docker-buildx-plugin 0.12.1-1~ubuntu.22.04~jammy amd64 Docker Buildx cli plugin.
ii docker-ce 5:25.0.1-1~ubuntu.22.04~jammy amd64 Docker: the open-source application container engine
ii docker-ce-cli 5:25.0.1-1~ubuntu.22.04~jammy amd64 Docker CLI: the open-source application container engine
ii docker-ce-rootless-extras 5:25.0.1-1~ubuntu.22.04~jammy amd64 Rootless support for Docker.
ii docker-compose-plugin 2.24.2-1~ubuntu.22.04~jammy amd64 Docker Compose (V2) plugin for the Docker CLI.
$ nvidia-container-cli info
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install
which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).
Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Ah. Thanks. Yes, we should definitely include the output of nvidia-smi -L
on the host in our issue template. The NVIDIA driver is required for the container toolkit to function.
Can we close this issue then?
Yes, thank you very much.
I had this same issue, btw, and fixed it with the same solution (downloading the drivers based off the docs here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation )
The reason I had this problem is because I thought I had already installed the drivers, because in these CUDA download instructions it is very much implied they are being installed as the last step, and I went with the legacy non open "flavor".
But the "download" instructions do not seem to have anything about "sudo apt-get install cuda-drivers-535" from the "cuda installation guide". I am a bit new to CUDA and eager to just get machine learnin' on my newly rented GPU server so I can't say I fully understand the difference between cuda-drivers-535 from the "installation guide" instructions and the "sudo apt-get install -y cuda-drivers" from the CUDA download instructions.
@bawee thanks for posting this question, saved me a lot of time!
Hi @elezar, I am trying to run an application on ubuntu 24.04 which needs nvidia container but stopped having the same issue. I have no nvidia hardware installed and I wish there be a solution such as gpu simulation.
No comments @elezar? How can I run the application without gpu?
Hi @elezar, it turns out the solution was to run
sudo ubuntu-drivers install
which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
thanks, I have meet same question, I slove it use the same way.
After reinstalling the cuda drivers, cuda toolkit and container toolkit, I was still getting the error. Issue got resolved by just reinstalling docker-ce using sudo apt-get install --reinstall docker-ce
Hello, I'm getting a load library failed error as as previous issue (unsure whether related, hence the new issue) when running a nextflow pipeline with docker that uses the nvidia-runtime-toolkit. It seems that the error is only present in the new version of nvidia-runtime-toolkit (1.14.4) but does not occur on an identical computer running version 1.14.3 which I had set up only few days prior.
Command error: docker: Error response from daemon: failed to create task for container: failed to create a shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. time="2024-01-24T14:05:51Z" level=error msg="error waiting for container: "
nvidia-runtime-toolkit was installed using apt on Ubuntu 22.04.3 following instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
I tried installing the old version instead (
sudo apt install nvidia-container-toolkit=1.14.3-1
) but it was unsuccessful due to an unavailable dependency.Thanks in advance!
Originally posted by @bawee in https://github.com/NVIDIA/nvidia-container-toolkit/issues/302#issuecomment-1908427326