Closed escorciav closed 6 years ago
I'm tackling nvidia-docker2. Things that didn't work me
+1 Fedora 27 user similarly stuck and looking for instructions
@escorciav Thanks for volunteering. By the way, pastebin is blocked on our corporate network, can you copy the error? Or provide an attachment.
@flx42, Error in Fedora 27 due to using repo from Centos 7:
$ dnf install nvidia-docker
Failed to synchronize cache for repo 'libnvidia-container', disabling.
Failed to synchronize cache for repo 'nvidia-container-runtime', disabling.
Failed to synchronize cache for repo 'nvidia-docker', disabling.
Last metadata expiration check: 1:08:53 ago on Wed 18 Apr 2018 03:45:34 PM +03.
No match for argument: nvidia-docker
Error: Unable to find a match
The .repo
file is:
$ cat /etc/yum.repos.d/nvidia-docker.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-docker]
name=nvidia-docker
baseurl=https://nvidia.github.io/nvidia-docker/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-docker/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
Hi @flx42,
I did a monkey typing approach to build nvidia-docker
and nvidia-container-runtime
via make. Apparently, everything ran without problems, and I ended-up with the following images (output is below).
# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia-docker2 18.03.0.ce-fedora27 679a30fc3930 About a minute ago 473MB
nvidia/runtime/fedora 27-docker1.12.6 1aa46723e854 16 minutes ago 1.73GB
nvidia/runtime/fedora 27-docker1.13.1 2b1a29593f49 16 minutes ago 1.73GB
nvidia/runtime/fedora 27-docker17.03.2 f933689823b9 16 minutes ago 1.73GB
nvidia/runtime/fedora 27-docker17.06.2 0d3b5e338b44 17 minutes ago 1.74GB
nvidia/runtime/fedora 27-docker17.09.0 4d034d0a9dcb 17 minutes ago 1.74GB
nvidia/runtime/fedora 27-docker17.09.1 50f3191ebdc0 18 minutes ago 1.74GB
nvidia/runtime/fedora 27-docker17.12.0 7e1330b2307a 18 minutes ago 1.73GB
nvidia/runtime/fedora 27-docker17.12.1 95c7427f9a50 19 minutes ago 1.73GB
nvidia/runtime/fedora 27-docker18.03.0 fdb954f2ee40 19 minutes ago 1.73GB
nvidia/hook/fedora 27 aeeaba24d9af 35 minutes ago 837MB
nvidia/base/fedora 27 514c0326c663 38 minutes ago 835MB
fedora 27 9110ae7f579f 6 weeks ago 235MB
# docker --version
Docker version 18.03.0-ce, build 0520e24
Update (April 20 after using solution below)
Apparently, after you build with make
a new folder called dist
with the rpm appears 😆.
I guess those .rpm files may work as well.
Did you also try doing directly rpm -i
on the packages we provide for centos 7?
where are those files? In the case of nvidia-docker, you only provided a rpm file for nvidia-docker 1.0.
Look at what's suggested here: https://github.com/NVIDIA/nvidia-docker/issues/635#issuecomment-365160098
update: Oct 24 2018
Please follow the strategy suggested here for Fedora 26, maybe it also works in newer versions.
Original message
Apparently, it works. Thanks!
The alternative that worked for me case was:
Clone the repos as follows (executed as root)
LOCALDIR=/var/lib/nvidia-docker-repo
mkdir -p $LOCALDIR && cd $LOCALDIR
git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git
Install rpm files manually Note: NOT copy-paste if your docker version is not 18.03.0.ce. Edit the last two lines accordingly.
rpm --import $LOCALDIR/nvidia-docker/gpgkey
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.beta.1.x86_64.rpm
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.beta.1.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.3.0-1.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.0.x86_64.rpm
rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.0.ce.noarch.rpm
Notes:
git pull
.sudo pkill -SIGHUP dockerd
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
@escorciav Thanks a lot! However I encountered a weird error when doing "sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=8702 /var/lib/docker/overlay2/f33c9f212b70e1069c28213f71d6a593c6a9e01eb2f4da9cfab15b0692578c6e/merged]\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\\n\\"\"": unknown.
I think i've been in seccomp mode
cat /boot/config-$(uname -r) | grep -i seccomp
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
sorry, I was attending an important issue.
I forgot to mention the version of docker that I used. Also, note that I installed (rpm -i []
) the packages that match with my docker version.
Other than that, I don't know how to help you.
@andys0975 try to update packages - I had the same issue.
(Updated @escorciav manual)
Clone the repos as follows (executed as root)
mkdir -p $LOCALDIR && cd $LOCALDIR
git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git
Install rpm files manually Note: NOT copy-paste if you're docker version is not 18.03.1.ce. Check ALL(!) packages listed below especially if you encounter the problem mentioned by @andys0975
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.rc.2.x86_64.rpm
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.4.0-1.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.1.x86_64.rpm
rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.1.ce.noarch.rpm
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
I didn't install nvidia-docker2 package. Because I used device mapper with direct lvm.
So I modified /etc/docker/daemon.json
manually. It works quite well in 4.17.12-200.fc28.x86_64
. I confirmed Pytroch from NVIDIA cloud registry works. I believed the installation and configuration works for the rest.
docker-ce-18.06.0.ce-3.el7.x86_64
nvidia-container-runtime-hook-1.4.0-1.x86_64
libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64
nvidia-container-runtime-2.0.0-1.docker18.06.0.x86_64
libnvidia-container1-1.0.0-0.1.rc.2.x86_64
/etc/docker/daemon.json
{
"storage-driver": "devicemapper",
"storage-opts": [
"dm.thinpooldev=/dev/mapper/docker-thinpool",
"dm.basesize=100G",
"dm.use_deferred_removal=true",
"dm.use_deferred_deletion=true"
],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
I did the procedure from https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 and it succeeded. Thx @escorciav
For people searching, procedure https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 works even in Fedora 34 with just a few more steps.
First make sure that you have installed both the nvidia drivers and cuda on your host system (install them from RPM Fusion).
After executing the commands in the linked comment you have to edit /etc/nvidia-container-runtime/config.toml
config.
Make sure to have this line: no-cgroups = true
(by default is should be commented and set to false)
Restart docker with systemctl.
Now you should be able to run your gpu containers in privileged mode (--privileged
flag).
Leaving out the privileged mode probably will lead you to "Unknown error" or logs complaining that the are missing libraries and a not working container.
Here's what I just did based on @rickycorte 's instructions and https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 to get nvidia-docker working with Fedora 34:
sudo dnf remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-selinux \
docker-engine-selinux \
docker-engine
Use centos8
repo instead of centos7
curl -s -L https://nvidia.github.io/nvidia-docker/centos8/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf install nvidia-docker2
edit /etc/nvidia-container-runtime/config.toml
: no-cgroups = true
sudo systemctl start docker
docker run --privileged --runtime=nvidia --rm nvidia/cuda:11.3.0-devel-ubuntu18.04 nvidia-smi
Tue Jun 1 05:17:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
For Fedora Workstation 36:
sudo dnf remove moby-engine
What I did for Fedora Workstation 36:
Uninstalled and reinstalled Nvidia Driver through Gnome Software and it worked. https://www.reddit.com/r/Fedora/comments/unfbel/comment/i89qnwp/
then
sudo dnf install xorg-x11-drv-nvidia-cuda
Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:
# Uninstall old docker engine
sudo dnf remove moby-engine
# Get latest docker engine
# https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin
# Get nvidia container toolkit, using the centos8 repo
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install nvidia-docker2
# Restart docker daemon and verify that it is working
sudo systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.
@JohanAR just a note: You should be able to use moby-engine
on Fedora as long as you:
nvidia-container-toolkit
package and not nvidia-docker2
/etc/docker/daemon.json
file to include the nvidia
runtime and then restarting the docker service:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Note that when using docker run --gpus all
even this is required, but it is recommented that the runtime be specified explicitly:
docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.
Now I'm getting Failed to initialize NVML: Insufficient Permissions
though, due to some SELinux stuff. Tried reinstalling container-selinux but that didn't help either
Seems like it's only when I try to run nvidia-smi in the nvidia/cuda container.. I could run my stable diffusion webui just fine, but I don't know if that has anything to do with that image being created yesterday before these problems started..
@JohanAR please create a new ticket against https://github.com/NVIDIA/nvidia-container-toolkit with details of your setup (including installed versions of the *nvidia-contianer*
packages) and the behaviour that you are seeing.
@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.
Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.
As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.
Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.
As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.
Oh, nope, nevermind, setting the default runtime to nvidia did not work. It seemed to after systemctl restart docker
, but after a reboot all I get now is an error for any container I try to start. Even a simple one:
[root@mediaserv yaml-test]# docker run hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container: context canceled
[root@mediaserv yaml-test]#
From what I recall I'd used essentially the same setup that @JohanAR described 5 days ago (above), however, I didn't have moby installed beforehand. Been operating that way since moving from Fedora 34 to 35 about 6 months ago-ish.
Aha! Found these in my dnf history from 4 days ago:
Upgrade libnvidia-container-devel-1.11.0-1.x86_64 @libnvidia-container
Upgraded libnvidia-container-devel-1.10.0-1.x86_64 @@System
Upgrade libnvidia-container-static-1.11.0-1.x86_64 @libnvidia-container
Upgraded libnvidia-container-static-1.10.0-1.x86_64 @@System
Upgrade libnvidia-container-tools-1.11.0-1.x86_64 @libnvidia-container
Upgraded libnvidia-container-tools-1.10.0-1.x86_64 @@System
Upgrade libnvidia-container1-1.11.0-1.x86_64 @libnvidia-container
Upgraded libnvidia-container1-1.10.0-1.x86_64 @@System
Upgrade libnvidia-container1-debuginfo-1.11.0-1.x86_64 @libnvidia-container
Upgraded libnvidia-container1-debuginfo-1.10.0-1.x86_64 @@System
Upgrade nvidia-container-toolkit-1.11.0-1.x86_64 @libnvidia-container
Upgraded nvidia-container-toolkit-1.10.0-1.x86_64 @@System
I removed the 1.11 version and nvidia-docker2, installed the 1.10 version, reinstalled nvidia-docker2. And it works as it did before now.
@PriamX it seems as if there may be a regression in our 1.11.0
packages -- although we didn't see this behaviour in our testing.
Would you be able to reproduce the failures with debug logging enabled (uncomment the #debug =
lines in /etc/nvidia-contianer-runtime/config.toml
) and provide the /var/log/nvidia-container-runtime.log
file? (ideally as an issue under https://github.com/NVIDIA/nvidia-container-toolkit)
Update: I see that you have already created https://github.com/NVIDIA/nvidia-container-toolkit/issues/34 let's continue the discussion there.
@elezar I did open issue #34 not long after I posted here. Saw you posted there. I'll move over to that conversation. Thanks!
Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:
# Uninstall old docker engine sudo dnf remove moby-engine # Get latest docker engine # https://docs.docker.com/engine/install/fedora/#install-using-the-repository sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin # Get nvidia container toolkit, using the centos8 repo # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo sudo dnf install nvidia-docker2 # Restart docker daemon and verify that it is working sudo systemctl restart docker.service docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.
@elezar @JohanAR Does this work on Fedora 37?
@airtonix no idea since I'm still using Fedora 36. However everything started working again after a couple of months, though I don't know exactly which package version that fixed it.
@airtonix for recent rpm-based distributions the first step is to install the centos8 packages. Then, our stack has changed quite a bit since the original post, and we no longer recommend that users install nvidia-docker2. Instead, our docs recommend (or should if they have not yet been updated) installing the nvidia-container-toolkit package and using the nvidia-ctk runtime configure command to apply the necessary configuration changes to the container engine such as docker.
Running
sudo nvidia-ctk runtime configure --runtime docker --config /etc/docker/daemon.json
Will update the config to include the nvidia runtime.
Restarting the docker daemon is still required to update the config.
Note there should be no technical reason for the stack to not work on newer fedora distributions.
What I did for Fedora Workstation 38:
sudo dnf install xorg-x11-drv-nvidia-cuda
Test:
docker run --privileged --runtime=nvidia --rm nvidia/cuda:12.1.1-devel-ubuntu22.04 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.1.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Mon May 1 13:58:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
Given that there is no direct support for Fedora.
Can we put a precise guide here? The idea is to make nvidia-docker works in Fedora. I volunteer to try out previous approaches in my system, Fedora 27. I installed very recently and it's almost brand new.