NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.21k stars 2.03k forks source link

Fedora installation procedure #706

Closed escorciav closed 6 years ago

escorciav commented 6 years ago

Given that there is no direct support for Fedora.

Can we put a precise guide here? The idea is to make nvidia-docker works in Fedora. I volunteer to try out previous approaches in my system, Fedora 27. I installed very recently and it's almost brand new.

escorciav commented 6 years ago

I'm tackling nvidia-docker2. Things that didn't work me

rbavery commented 6 years ago

+1 Fedora 27 user similarly stuck and looking for instructions

flx42 commented 6 years ago

@escorciav Thanks for volunteering. By the way, pastebin is blocked on our corporate network, can you copy the error? Or provide an attachment.

escorciav commented 6 years ago

@flx42, Error in Fedora 27 due to using repo from Centos 7:

$ dnf install nvidia-docker
Failed to synchronize cache for repo 'libnvidia-container', disabling.
Failed to synchronize cache for repo 'nvidia-container-runtime', disabling.
Failed to synchronize cache for repo 'nvidia-docker', disabling.
Last metadata expiration check: 1:08:53 ago on Wed 18 Apr 2018 03:45:34 PM +03.
No match for argument: nvidia-docker
Error: Unable to find a match

The .repo file is:

$ cat /etc/yum.repos.d/nvidia-docker.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-docker]
name=nvidia-docker
baseurl=https://nvidia.github.io/nvidia-docker/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-docker/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
escorciav commented 6 years ago

Hi @flx42, I did a monkey typing approach to build nvidia-docker and nvidia-container-runtime via make. Apparently, everything ran without problems, and I ended-up with the following images (output is below).

# docker images            
REPOSITORY              TAG                   IMAGE ID            CREATED              SIZE
nvidia-docker2          18.03.0.ce-fedora27   679a30fc3930        About a minute ago   473MB
nvidia/runtime/fedora   27-docker1.12.6       1aa46723e854        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker1.13.1       2b1a29593f49        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.03.2      f933689823b9        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.06.2      0d3b5e338b44        17 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.09.0      4d034d0a9dcb        17 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.09.1      50f3191ebdc0        18 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.12.0      7e1330b2307a        18 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.12.1      95c7427f9a50        19 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker18.03.0      fdb954f2ee40        19 minutes ago       1.73GB
nvidia/hook/fedora      27                    aeeaba24d9af        35 minutes ago       837MB
nvidia/base/fedora      27                    514c0326c663        38 minutes ago       835MB
fedora                  27                    9110ae7f579f        6 weeks ago          235MB
# docker --version 
Docker version 18.03.0-ce, build 0520e24

Update (April 20 after using solution below) Apparently, after you build with make a new folder called dist with the rpm appears 😆. I guess those .rpm files may work as well.

flx42 commented 6 years ago

Did you also try doing directly rpm -i on the packages we provide for centos 7?

escorciav commented 6 years ago

where are those files? In the case of nvidia-docker, you only provided a rpm file for nvidia-docker 1.0.

flx42 commented 6 years ago

Look at what's suggested here: https://github.com/NVIDIA/nvidia-docker/issues/635#issuecomment-365160098

escorciav commented 6 years ago

update: Oct 24 2018

Please follow the strategy suggested here for Fedora 26, maybe it also works in newer versions.

Original message

Apparently, it works. Thanks!

The alternative that worked for me case was:

  1. Clone the repos as follows (executed as root)

    LOCALDIR=/var/lib/nvidia-docker-repo
    mkdir -p $LOCALDIR && cd $LOCALDIR
    git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
    git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
    git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git
  2. Install rpm files manually Note: NOT copy-paste if your docker version is not 18.03.0.ce. Edit the last two lines accordingly.

    rpm --import $LOCALDIR/nvidia-docker/gpgkey
    rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.beta.1.x86_64.rpm
    rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.beta.1.x86_64.rpm
    rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.3.0-1.x86_64.rpm
    rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.0.x86_64.rpm
    rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.0.ce.noarch.rpm

    Notes:

    • According to the issue mentioned by @flx42, you can update it by doing git pull.
    • I tried to setup the yum repo but keep receiving the error of loading the repo. I guess I am not registering the .repo file properly.
    • Tested by doing:
      sudo pkill -SIGHUP dockerd
      docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
andys0975 commented 6 years ago

@escorciav Thanks a lot! However I encountered a weird error when doing "sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi"

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=8702 /var/lib/docker/overlay2/f33c9f212b70e1069c28213f71d6a593c6a9e01eb2f4da9cfab15b0692578c6e/merged]\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\\n\\"\"": unknown.

I think i've been in seccomp mode cat /boot/config-$(uname -r) | grep -i seccomp CONFIG_HAVE_ARCH_SECCOMP_FILTER=y CONFIG_SECCOMP_FILTER=y CONFIG_SECCOMP=y

escorciav commented 6 years ago

sorry, I was attending an important issue.

I forgot to mention the version of docker that I used. Also, note that I installed (rpm -i []) the packages that match with my docker version.

Other than that, I don't know how to help you.

pawelmarkowski commented 6 years ago

@andys0975 try to update packages - I had the same issue.

(Updated @escorciav manual)

Clone the repos as follows (executed as root)

mkdir -p $LOCALDIR && cd $LOCALDIR
git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git

Install rpm files manually Note: NOT copy-paste if you're docker version is not 18.03.1.ce. Check ALL(!) packages listed below especially if you encounter the problem mentioned by @andys0975

rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.rc.2.x86_64.rpm 
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64.rpm 
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.4.0-1.x86_64.rpm 
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.1.x86_64.rpm 
rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.1.ce.noarch.rpm 
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
rickyzhang82 commented 6 years ago

I didn't install nvidia-docker2 package. Because I used device mapper with direct lvm.

So I modified /etc/docker/daemon.json manually. It works quite well in 4.17.12-200.fc28.x86_64. I confirmed Pytroch from NVIDIA cloud registry works. I believed the installation and configuration works for the rest.

docker-ce-18.06.0.ce-3.el7.x86_64
nvidia-container-runtime-hook-1.4.0-1.x86_64
libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64
nvidia-container-runtime-2.0.0-1.docker18.06.0.x86_64
libnvidia-container1-1.0.0-0.1.rc.2.x86_64

/etc/docker/daemon.json

{
    "storage-driver": "devicemapper",
    "storage-opts": [
    "dm.thinpooldev=/dev/mapper/docker-thinpool",
    "dm.basesize=100G",
    "dm.use_deferred_removal=true",
    "dm.use_deferred_deletion=true"
    ],    
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
escorciav commented 5 years ago

quick-update: as @OleRoel mentioned here. I did a silly mistake following his procedure. That works like a charm in a fedora 26 machine. I highly recommend to try this approach first as it's much easier.

jamesdbrock commented 5 years ago

I did the procedure from https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 and it succeeded. Thx @escorciav

rickycorte commented 3 years ago

For people searching, procedure https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 works even in Fedora 34 with just a few more steps.

First make sure that you have installed both the nvidia drivers and cuda on your host system (install them from RPM Fusion).

After executing the commands in the linked comment you have to edit /etc/nvidia-container-runtime/config.toml config. Make sure to have this line: no-cgroups = true (by default is should be commented and set to false) Restart docker with systemctl. Now you should be able to run your gpu containers in privileged mode (--privileged flag).

Leaving out the privileged mode probably will lead you to "Unknown error" or logs complaining that the are missing libraries and a not working container.

jamesdbrock commented 3 years ago

Here's what I just did based on @rickycorte 's instructions and https://github.com/NVIDIA/nvidia-docker/issues/553#issuecomment-381075335 to get nvidia-docker working with Fedora 34:

sudo dnf remove docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-selinux \
                  docker-engine-selinux \
                  docker-engine

Use centos8 repo instead of centos7

curl -s -L https://nvidia.github.io/nvidia-docker/centos8/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo dnf install nvidia-docker2

edit /etc/nvidia-container-runtime/config.toml: no-cgroups = true

sudo systemctl start docker

docker run --privileged --runtime=nvidia --rm nvidia/cuda:11.3.0-devel-ubuntu18.04 nvidia-smi
Tue Jun  1 05:17:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
raffaem commented 2 years ago

For Fedora Workstation 36:

  1. Run sudo dnf remove moby-engine
  2. Install Docker Engine following these instructions
  3. Follow jamesdbrock's instructions
jamesdbrock commented 2 years ago

What I did for Fedora Workstation 36:

Uninstalled and reinstalled Nvidia Driver through Gnome Software and it worked. https://www.reddit.com/r/Fedora/comments/unfbel/comment/i89qnwp/

then

sudo dnf install xorg-x11-drv-nvidia-cuda
JohanAR commented 2 years ago

Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:

# Uninstall old docker engine
sudo dnf remove moby-engine

# Get latest docker engine
# https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# Get nvidia container toolkit, using the centos8 repo
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install nvidia-docker2

# Restart docker daemon and verify that it is working
sudo systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.

elezar commented 2 years ago

@JohanAR just a note: You should be able to use moby-engine on Fedora as long as you:

  1. Install the nvidia-container-toolkit package and not nvidia-docker2
  2. Configure your /etc/docker/daemon.json file to include the nvidia runtime and then restarting the docker service:
    {
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
    }

    Note that when using docker run --gpus all even this is required, but it is recommented that the runtime be specified explicitly:

    docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
JohanAR commented 2 years ago

@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.

Now I'm getting Failed to initialize NVML: Insufficient Permissions though, due to some SELinux stuff. Tried reinstalling container-selinux but that didn't help either

Seems like it's only when I try to run nvidia-smi in the nvidia/cuda container.. I could run my stable diffusion webui just fine, but I don't know if that has anything to do with that image being created yesterday before these problems started..

elezar commented 2 years ago

@JohanAR please create a new ticket against https://github.com/NVIDIA/nvidia-container-toolkit with details of your setup (including installed versions of the *nvidia-contianer* packages) and the behaviour that you are seeing.

PriamX commented 2 years ago

@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.

Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.

As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.

PriamX commented 2 years ago

Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.

As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.

Oh, nope, nevermind, setting the default runtime to nvidia did not work. It seemed to after systemctl restart docker, but after a reboot all I get now is an error for any container I try to start. Even a simple one:

[root@mediaserv yaml-test]# docker run hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container: context canceled
[root@mediaserv yaml-test]#

From what I recall I'd used essentially the same setup that @JohanAR described 5 days ago (above), however, I didn't have moby installed beforehand. Been operating that way since moving from Fedora 34 to 35 about 6 months ago-ish.

PriamX commented 2 years ago

Aha! Found these in my dnf history from 4 days ago:

    Upgrade  libnvidia-container-devel-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-devel-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container-static-1.11.0-1.x86_64           @libnvidia-container
    Upgraded libnvidia-container-static-1.10.0-1.x86_64           @@System
    Upgrade  libnvidia-container-tools-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-tools-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container1-1.11.0-1.x86_64                 @libnvidia-container
    Upgraded libnvidia-container1-1.10.0-1.x86_64                 @@System
    Upgrade  libnvidia-container1-debuginfo-1.11.0-1.x86_64       @libnvidia-container
    Upgraded libnvidia-container1-debuginfo-1.10.0-1.x86_64       @@System
    Upgrade  nvidia-container-toolkit-1.11.0-1.x86_64             @libnvidia-container
    Upgraded nvidia-container-toolkit-1.10.0-1.x86_64             @@System

I removed the 1.11 version and nvidia-docker2, installed the 1.10 version, reinstalled nvidia-docker2. And it works as it did before now.

elezar commented 2 years ago

@PriamX it seems as if there may be a regression in our 1.11.0 packages -- although we didn't see this behaviour in our testing.

Would you be able to reproduce the failures with debug logging enabled (uncomment the #debug = lines in /etc/nvidia-contianer-runtime/config.toml) and provide the /var/log/nvidia-container-runtime.log file? (ideally as an issue under https://github.com/NVIDIA/nvidia-container-toolkit)

Update: I see that you have already created https://github.com/NVIDIA/nvidia-container-toolkit/issues/34 let's continue the discussion there.

PriamX commented 2 years ago

@elezar I did open issue #34 not long after I posted here. Saw you posted there. I'll move over to that conversation. Thanks!

airtonix commented 1 year ago

Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:

# Uninstall old docker engine
sudo dnf remove moby-engine

# Get latest docker engine
# https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# Get nvidia container toolkit, using the centos8 repo
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install nvidia-docker2

# Restart docker daemon and verify that it is working
sudo systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.

@elezar @JohanAR Does this work on Fedora 37?

JohanAR commented 1 year ago

@airtonix no idea since I'm still using Fedora 36. However everything started working again after a couple of months, though I don't know exactly which package version that fixed it.

elezar commented 1 year ago

@airtonix for recent rpm-based distributions the first step is to install the centos8 packages. Then, our stack has changed quite a bit since the original post, and we no longer recommend that users install nvidia-docker2. Instead, our docs recommend (or should if they have not yet been updated) installing the nvidia-container-toolkit package and using the nvidia-ctk runtime configure command to apply the necessary configuration changes to the container engine such as docker.

Running

sudo nvidia-ctk runtime configure --runtime docker --config /etc/docker/daemon.json

Will update the config to include the nvidia runtime.

Restarting the docker daemon is still required to update the config.

Note there should be no technical reason for the stack to not work on newer fedora distributions.

jamesdbrock commented 1 year ago

What I did for Fedora Workstation 38:

  1. Uninstalled and reinstalled Nvidia Graphics Driver through Gnome Software. https://www.reddit.com/r/Fedora/comments/unfbel/comment/i89qnwp/
  2. sudo dnf install xorg-x11-drv-nvidia-cuda

Test:

docker run --privileged --runtime=nvidia --rm nvidia/cuda:12.1.1-devel-ubuntu22.04 nvidia-smi
==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Mon May  1 13:58:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |