Closed rhatdan closed 1 year ago
The Podman team would like to work with you guys to get this to work well in both root full and rootless containers if possible. But we need someone to work with.
@mheon @baude FYI
@sjug FYI
Hello!
@rhatdan do you mind filling the following issue template: https://github.com/NVIDIA/nvidia-docker/blob/master/.github/ISSUE_TEMPLATE.md
Thanks!
I can work with the podman team.
@hholst80 FYI
@nvjmayo Thanks for the suggestions. Some good news and less good.
This works rootless: podman run --rm --hooks-dir /usr/share/containers/oci/hooks.d nvcr.io/nvidia/cuda nvidia-smi The same command continues to fail with the image: docker.io/nvidia/cuda
In fact rootless works with or without /usr/share/containers/oci/hooks.d/01-nvhook.json installed using the image: nvcr.io/nvidia/cuda
Running as root continues to fail when no-cgroups = true for either container, returning: Failed to initialize NVML: Unknown Error
Strange I would not expect podman to run a hook that did not have a json file describing the hook.
@eaepstein I'm still struggling to reproduce the issue you see. Using docker.io/nvidia/cuda also works for me with the hooks dir.
$ podman run --rm --hooks-dir /usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda nvidia-smi
Tue Oct 22 21:35:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 710 Off | 00000000:65:00.0 N/A | N/A |
| 50% 38C P0 N/A / N/A | 0MiB / 2001MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
without the hook I would expect to see a failure roughly like:
Error: time="2019-10-22T14:35:14-07:00" level=error msg="container_linux.go:346: starting container process caused \"exec: \\\"nvidia-smi\\\": executable file not found in $PATH\""
container_linux.go:346: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": OCI runtime command not found error
This is because the libraries and tools get installed by the hook in order to match the host drivers. (an unfortunate limitation of tightly coupled driver+library releases)
I think there is a configuration issue and not an issue of the container image (docker.io/nvidia/cuda vs nvcr.io/nvidia/cuda).
Reviewing my earlier posts, I recommend changing my 01-nvhook.json and remove the NVIDIA_REQUIRE_CUDA=cuda>=10.1
from it. My assumption is everyone has the latest CUDA install, which was kind of a silly assumption on my part. The CUDA version doesn't have to be specified, and you can leave this environment variable out of your set up. It was an artifact of my earlier experiments.
@nvjmayo we started from scratch with a new machine (CentOS Linux release 7.7.1908) and both docker.io and nvcr.io images are working for us now too. And --hooks-dir must now be specified for both to work. Thanks for the help!
@rhatdan @nvjmayo Turns out that getting rootless podman working with nvidia on centos 7 is a bit more complicated, at least for us.
Here is our scenario on brand new centos 7.7 machine
run nvidia-smi with rootless podman 1.result: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\\n\\"\"
run podman with user=root 2.result: nvidia-smi works
run podman rootless 3.result: nvidia-smi works!
reboot machine, run podman rootless 4.result: fails again with same error as NVIDIA/nvidia-docker#1
Conclusion: running nvidia container with podman as root changes the environment for rootless to work. Environment cleared on reboot.
One other comment: podman as root and rootless podman cannot run with the same /etc/nvidia-container-runtime/config.toml - no-cgroups must =false for root and =true for rootless
If the nvidia hook is doing any privileged operations like modifying /dev and adding devicenodes, then this will not work with rootless. (In rootless all processes are running with the Users UID. Probably when you run rootfull, it is doing the privileged operations, so the next time you run rootless, those activities do not need to be done.
I would suggest for rootless systems, that the /dev and nvidia ops be done as a systemd unit file, so the system is preconfigured and then the rootless jobs will work fine.
After running nvidia/cuda with rootfull podman, the following exist: crw-rw-rw-. 1 root root 195, 254 Oct 25 09:11 nvidia-modeset crw-rw-rw-. 1 root root 195, 255 Oct 25 09:11 nvidiactl crw-rw-rw-. 1 root root 195, 0 Oct 25 09:11 nvidia0 crw-rw-rw-. 1 root root 241, 1 Oct 25 09:11 nvidia-uvm-tools crw-rw-rw-. 1 root root 241, 0 Oct 25 09:11 nvidia-uvm
None of these devices exist after boot. Running nvidia-smi rootless (no podman) creates: crw-rw-rw-. 1 root root 195, 0 Oct 25 13:40 nvidia0 crw-rw-rw-. 1 root root 195, 255 Oct 25 13:40 nvidiactl
I created the other three entries using "sudo mknod -m 666 etc..." but that is not enough to run rootless. Something else is needed in the environment.
Running nvidia/cuda with rootfull podman at boot would work, but not pretty.
Thanks for the suggestion
This behavior is documented in our installation guide: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications
From a userns you can't mknod
or use nvidia-modprobe
. But, if this binary is present and if it can be called in a context where setuid
works, it's an option.
There is already nvidia-persistenced
as a systemd unit file, but it won't load the nvidia_uvm
kernel modules nor create the device files, IIRC.
Another option is to use udev
rules, which is what Ubuntu is doing:
$ cat /lib/udev/rules.d/71-nvidia.rules
[...]
# Load and unload nvidia-uvm module
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/sbin/modprobe nvidia-uvm"
ACTION=="remove", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/sbin/modprobe -r nvidia-uvm"
# This will create the device nvidia device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-smi"
# Create the device node for the nvidia-uvm module
ACTION=="add", DEVPATH=="/module/nvidia_uvm", SUBSYSTEM=="module", RUN+="/sbin/create-uvm-dev-node"
Udev rules makes sense to me.
@flx42 sudo'ing the setup script in "4.5. Device Node Verification" is the only thing needed to get rootless nvidia/cuda containers running for us. It created the following devices: crw-rw-rw-. 1 root root 195, 0 Oct 27 20:38 nvidia0 crw-rw-rw-. 1 root root 195, 255 Oct 27 20:38 nvidiactl crw-rw-rw-. 1 root root 241, 0 Oct 27 20:38 nvidia-uvm
The udev file only created the first two and was not sufficient by itself. We'll go with a unit file for the setup script.
Many thanks for your help.
Thanks guys, with insight from this issue and others, I was able to get podman working with my Quadro in EL7 using sudo podman run --privileged --rm --hooks-dir /usr/share/containers/oci/hooks.d docker.io/nvidia/cudagl:10.1-runtime-centos7 nvidia-smi
after installing the 'nvidia-container-toolkit' package.
Once the dust settles on how to get GPU support in rootless podman in EL7, a step-by-step guide would make for a great blog post and/or entry into the podman and/or nvidia documentation.
Hello @nvjmayo and @rhatdan. I'm wondering if there is an update on this issue or this one for how to access NVIDIA GPU's from containers run rootless with podman.
On RHEL8.1, with default /etc/nvidia-container-runtime/config.toml, and running containers with root, GPU access works as expected. Rootless does not work by default, it fails with cgroup related errors (as expected).
After modifying the config.toml file -- setting no-cgroups = true and changing the debug log file -- rootless works. However, these changes make GPU access fail in containers run as root, with error "Failed to initialize NVML: Unknown Error."
Please let me know if there is any recent documentation on how to do this beyond these two issues.
Steps to get it working on RHEL 8.1:
nvidia-smi
works on the hostnvidia-container-toolkit
from repos at
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
/etc/nvidia-container-runtime/config.toml
and change these values:
[nvidia-container-cli]
#no-cgroups = false
no-cgroups = true
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "~/./local/nvidia-container-runtime.log"
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:10.2-devel-ubi8 /usr/bin/nvidia-smi
/cc @dagrayvid
Thanks @jamescassell.
I repeated those steps on RHEL8.1, and nvidia-smi works as expected when running rootless. However, once those changes are made, I am unable to run nvidia-smi in a container run as root. Is this behaviour expected, or is there some change in CLI flags needed when running as root? Running as root did work before making these changes.
Is there a way to configure a system so that we can utilize GPUs with podman as root and non-root user?
I can't run podman rootless with GPU, someone can help me?
docker run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
works fine but
podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
crashes, same for
sudo podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
Output:
$ podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
2020/04/03 13:34:52 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
Error: `/usr/bin/nvidia-container-runtime start e3ccb660bf27ce0858ee56476e58b53cd3dc900e8de80f08d10f3f844c0e9f9a` failed: exit status 1
But, runc exists:
$ whereis runc
runc: /usr/bin/runc
$ whereis docker-runc
docker-runc:
$ podman --version
podman version 1.8.2
$ cat ~/.config/containers/libpod.conf
# libpod.conf is the default configuration file for all tools using libpod to
# manage containers
# Default transport method for pulling and pushing for images
image_default_transport = "docker://"
# Paths to look for the conmon container manager binary.
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
conmon_path = [
"/usr/libexec/podman/conmon",
"/usr/local/libexec/podman/conmon",
"/usr/local/lib/podman/conmon",
"/usr/bin/conmon",
"/usr/sbin/conmon",
"/usr/local/bin/conmon",
"/usr/local/sbin/conmon",
"/run/current-system/sw/bin/conmon",
]
# Environment variables to pass into conmon
conmon_env_vars = [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
# CGroup Manager - valid values are "systemd" and "cgroupfs"
#cgroup_manager = "systemd"
# Container init binary
#init_path = "/usr/libexec/podman/catatonit"
# Directory for persistent libpod files (database, etc)
# By default, this will be configured relative to where containers/storage
# stores containers
# Uncomment to change location from this default
#static_dir = "/var/lib/containers/storage/libpod"
# Directory for temporary files. Must be tmpfs (wiped after reboot)
#tmp_dir = "/var/run/libpod"
tmp_dir = "/run/user/1000/libpod/tmp"
# Maximum size of log files (in bytes)
# -1 is unlimited
max_log_size = -1
# Whether to use chroot instead of pivot_root in the runtime
no_pivot_root = false
# Directory containing CNI plugin configuration files
cni_config_dir = "/etc/cni/net.d/"
# Directories where the CNI plugin binaries may be located
cni_plugin_dir = [
"/usr/libexec/cni",
"/usr/lib/cni",
"/usr/local/lib/cni",
"/opt/cni/bin"
]
# Default CNI network for libpod.
# If multiple CNI network configs are present, libpod will use the network with
# the name given here for containers unless explicitly overridden.
# The default here is set to the name we set in the
# 87-podman-bridge.conflist included in the repository.
# Not setting this, or setting it to the empty string, will use normal CNI
# precedence rules for selecting between multiple networks.
cni_default_network = "podman"
# Default libpod namespace
# If libpod is joined to a namespace, it will see only containers and pods
# that were created in the same namespace, and will create new containers and
# pods in that namespace.
# The default namespace is "", which corresponds to no namespace. When no
# namespace is set, all containers and pods are visible.
#namespace = ""
# Default infra (pause) image name for pod infra containers
infra_image = "k8s.gcr.io/pause:3.1"
# Default command to run the infra container
infra_command = "/pause"
# Determines whether libpod will reserve ports on the host when they are
# forwarded to containers. When enabled, when ports are forwarded to containers,
# they are held open by conmon as long as the container is running, ensuring that
# they cannot be reused by other programs on the host. However, this can cause
# significant memory usage if a container has many ports forwarded to it.
# Disabling this can save memory.
#enable_port_reservation = true
# Default libpod support for container labeling
# label=true
# The locking mechanism to use
lock_type = "shm"
# Number of locks available for containers and pods.
# If this is changed, a lock renumber must be performed (e.g. with the
# 'podman system renumber' command).
num_locks = 2048
# Directory for libpod named volumes.
# By default, this will be configured relative to where containers/storage
# stores containers.
# Uncomment to change location from this default.
#volume_path = "/var/lib/containers/storage/volumes"
# Selects which logging mechanism to use for Podman events. Valid values
# are `journald` or `file`.
# events_logger = "journald"
# Specify the keys sequence used to detach a container.
# Format is a single character [a-Z] or a comma separated sequence of
# `ctrl-<value>`, where `<value>` is one of:
# `a-z`, `@`, `^`, `[`, `\`, `]`, `^` or `_`
#
# detach_keys = "ctrl-p,ctrl-q"
# Default OCI runtime
runtime = "runc"
# List of the OCI runtimes that support --format=json. When json is supported
# libpod will use it for reporting nicer errors.
runtime_supports_json = ["crun", "runc"]
# List of all the OCI runtimes that support --cgroup-manager=disable to disable
# creation of CGroups for containers.
runtime_supports_nocgroups = ["crun"]
# Paths to look for a valid OCI runtime (runc, runv, etc)
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
[runtimes]
runc = [
"/usr/bin/runc",
"/usr/sbin/runc",
"/usr/local/bin/runc",
"/usr/local/sbin/runc",
"/sbin/runc",
"/bin/runc",
"/usr/lib/cri-o-runc/sbin/runc",
"/run/current-system/sw/bin/runc",
]
crun = [
"/usr/bin/crun",
"/usr/sbin/crun",
"/usr/local/bin/crun",
"/usr/local/sbin/crun",
"/sbin/crun",
"/bin/crun",
"/run/current-system/sw/bin/crun",
]
nvidia = ["/usr/bin/nvidia-container-runtime"]
# Kata Containers is an OCI runtime, where containers are run inside lightweight
# Virtual Machines (VMs). Kata provides additional isolation towards the host,
# minimizing the host attack surface and mitigating the consequences of
# containers breakout.
# Please notes that Kata does not support rootless podman yet, but we can leave
# the paths below blank to let them be discovered by the $PATH environment
# variable.
# Kata Containers with the default configured VMM
kata-runtime = [
"/usr/bin/kata-runtime",
]
# Kata Containers with the QEMU VMM
kata-qemu = [
"/usr/bin/kata-qemu",
]
# Kata Containers with the Firecracker VMM
kata-fc = [
"/usr/bin/kata-fc",
]
# The [runtimes] table MUST be the last thing in this file.
# (Unless another table is added)
# TOML does not provide a way to end a table other than a further table being
# defined, so every key hereafter will be part of [runtimes] and not the main
# config.
$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
debug = "/tmp/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "/tmp/nvidia-container-runtime.log
$ cat /tmp/nvidia-container-runtime.log
2020/04/03 13:23:02 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:02 Using bundle file: /home/andrews/.local/share/containers/storage/vfs-containers/614cb26f8f4719e3aba56be2e1a6dc29cd91ae760d9fe3bf83d6d1b24becc638/userdata/config.json
2020/04/03 13:23:02 prestart hook path: /usr/bin/nvidia-container-runtime-hook
2020/04/03 13:23:02 Prestart hook added, executing runc
2020/04/03 13:23:02 Looking for "docker-runc" binary
2020/04/03 13:23:02 "docker-runc" binary not found
2020/04/03 13:23:02 Looking for "runc" binary
2020/04/03 13:23:02 Runc path: /usr/bin/runc
2020/04/03 13:23:09 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:09 Command is not "create", executing runc doing nothing
2020/04/03 13:23:09 Looking for "docker-runc" binary
2020/04/03 13:23:09 "docker-runc" binary not found
2020/04/03 13:23:09 Looking for "runc" binary
2020/04/03 13:23:09 ERROR: find runc path: exec: "runc": executable file not found in $PATH
2020/04/03 13:31:06 Running nvidia-container-runtime
2020/04/03 13:31:06 Command is not "create", executing runc doing nothing
2020/04/03 13:31:06 Looking for "docker-runc" binary
2020/04/03 13:31:06 "docker-runc" binary not found
2020/04/03 13:31:06 Looking for "runc" binary
2020/04/03 13:31:06 Runc path: /usr/bin/runc
$ nvidia-container-runtime --version
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
NVRM version: 440.64.00
CUDA version: 10.2
Device Index: 0
Device Minor: 0
Model: GeForce RTX 2070
Brand: GeForce
GPU UUID: GPU-22dfd02e-a668-a6a6-a90a-39d6efe475ee
Bus Location: 00000000:01:00.0
Architecture: 7.5
$ docker version
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:23 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.8
API version: 1.40 (minimum version 1.12)
Go version: go1.12.17
Git commit: afacb8b7f0
Built: Wed Mar 11 01:24:19 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
See particularly step 4. https://github.com/NVIDIA/nvidia-container-runtime/issues/85#issuecomment-604931556
This looks like the nvidia plugin is searching for a hard coded path to runc?
[updated] Hi @jamescassell , unfortunately do not work for me.
(same error using sudo
)
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --runtime=nvidia nvidia/cudanvidia-smi
2020/04/03 17:33:06 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
2020/04/03 17:33:06 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
Error: `/usr/bin/nvidia-container-runtime start 060398d97299ee033e8ebd698a11c128bd80ce641dd389976ca43a34b26abab3` failed: exit status 1
Hi @jamescassell , unfortunately do not work for me.
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda nvidia-smi Error: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": OCI runtime command not found error
Did you make the other changes described? I'd hit the same error until making the config changes.
@jamescassell yes, see https://github.com/NVIDIA/nvidia-container-runtime/issues/85#issuecomment-608469598
Not sure if it's relevant but looks like you're missing a quote: debug = "/tmp/nvidia-container-runtime.log
@jamescassell
$ sudo nano /etc/nvidia-container-runtime/config.toml
I think this is a podman issue. Podman is not passing $PATH down to conmon when it executes it.
https://github.com/containers/libpod/pull/5712
I am not sure if conmon then passes the PATH environment down to the OCI runtime either.
@rhatdan yes , I will check this PR https://github.com/containers/libpod/pull/5712 Thanks
I had a major issue with this error message popping up when trying to change my container user id while adding the hook that was made to fix the rootless problem.
Error: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\\\\n\\\"\"": OCI runtime error
But I've since learned that this particular behavior is quite quirky and where I thought I pinpointed it, it now seems to work, if there is a call to the container using sudo (the container wouldn't work but the subsequent command did). Eagerly awaiting an update where root (no pun intended) of this nvidia container problem gets addressed.
Hi @rhatdan , answering your previous question https://github.com/containers/libpod/pull/5712#issuecomment-608516075 I was able to install the new version of podman, and it works fine with my GPU, however, I am getting this strange behavior at the end of the execution, please see:
andrews@deeplearning:~/Projects$ podman run -it --rm --runtime=nvidia --privileged nvidia/cuda:10.0-cudnn7-runtime nvidia-smi
Mon May 18 21:30:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 37% 30C P8 9W / 175W | 166MiB / 7979MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
2020/05/18 23:30:18 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
ERRO[0003] Error removing container 672a332467da4e91d8ac2fdc8f3c2973a808321341c2d80caa8d0ecad4f0db65: error removing container 672a332467da4e91d8ac2fdc8f3c2973a808321341c2d80caa8d0ecad4f0db65 from runtime: `/usr/bin/nvidia-container-runtime delete --force 672a332467da4e91d8ac2fdc8f3c2973a808321341c2d80caa8d0ecad4f0db65` failed: exit status 1
andrews@deeplearning:~$ podman --version
podman version 1.9.2
andrews@deeplearning:~$ cat /tmp/nvidia-container-runtime.log
2020/05/18 23:47:47 Running /usr/bin/nvidia-container-runtime
2020/05/18 23:47:47 Using bundle file: /home/andrews/.local/share/containers/storage/vfs-containers/3add1cc2bcb9cecde045877d9a0e4d3ed9f64d304cd5cb07fd0e072bf163a170/userdata/config.json
2020/05/18 23:47:47 prestart hook path: /usr/bin/nvidia-container-runtime-hook
2020/05/18 23:47:47 Prestart hook added, executing runc
2020/05/18 23:47:47 Looking for "docker-runc" binary
2020/05/18 23:47:47 Runc path: /usr/bin/docker-runc
2020/05/18 23:47:48 Running /usr/bin/nvidia-container-runtime
2020/05/18 23:47:48 Command is not "create", executing runc doing nothing
2020/05/18 23:47:48 Looking for "docker-runc" binary
2020/05/18 23:47:48 Runc path: /usr/bin/docker-runc
2020/05/18 23:47:48 Running /usr/bin/nvidia-container-runtime
2020/05/18 23:47:48 Command is not "create", executing runc doing nothing
2020/05/18 23:47:48 Looking for "docker-runc" binary
2020/05/18 23:47:48 "docker-runc" binary not found
2020/05/18 23:47:48 Looking for "runc" binary
2020/05/18 23:47:48 ERROR: find runc path: exec: "runc": executable file not found in $PATH
andrews@deeplearning:~$ nvidia-container-runtime --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev
andrews@deeplearning:~$ whereis runc
runc: /usr/bin/runc
andrews@deeplearning:~$ whereis docker-runc
docker-runc: /usr/bin/docker-runc
do you know what it can be?
The error you are getting looks like the $PATH was not being passed into you OCI Runtime?
Yes, it's strange...
- Modify
/etc/nvidia-container-runtime/config.toml
and change these values: ...- run it rootless as
podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvidia/cuda:10.2-devel-ubi8 /usr/bin/nvidia-smi
This did the trick for me, thanks. I'm pondering the user/process isolation ramifications of these changes on a multi-user system. Hopefully, RH/NVDA can get this as elegant as Docker's --gpus=all
without significantly degrading the security benefits of rootless podman over docker...
If you leave the SELinux enabled, what AVC's are you seeing?
Amazing work! I was able to get to run GPU enabled containers on Fedora 32 using centos8 repos, and only modifying the /etc/nvidia-container-runtime/config.toml
changing no-cgroups = true
. I was wondering what are the implications of not using the hooks-dir ?
Thanks
Update: Checking a tensorflow image, works flawlessly:
Podman rootless with version 1.9.3
For anyone who is looking to have rootless "nvidia-docker" be more or less seamless with podman I would suggest the following changes:
$ cat ~/.config/containers/libpod.conf
hooks_dir = ["/usr/share/containers/oci/hooks.d", "/etc/containers/oci/hooks.d"]
label = false
$ grep no-cgroups /etc/nvidia-container-runtime/config.toml
no-cgroups = true
After the above changes on Fedora 32 I can run nvidia-smi
using just:
$ podman run -it --rm nvidia/cuda:10.2-base nvidia-smi
Fri Jun 26 22:49:50 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX Off | 00000000:08:00.0 On | N/A |
| 41% 35C P8 5W / 280W | 599MiB / 24186MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The only annoyance is needing to edit /etc/nvidia-container-runtime/config.toml
whenever there is a package update for nvidia-container-toolkit
, which fortunately doesn't happen too often. If there was someway to make changes to config.toml
persistent across updates or an user config file (without using some hack like chattr +i
) then this process would be really smooth.
Maybe in the future a more targeted approach for disabling SELinux will come along that is more secure than just disabling labeling completely for lazy people like myself. I only run a few GPU-based containers here and there so I'm personally not too concerned.
@zeroepoch You can add an SELinux policy, see here: https://github.com/mjlbach/podman_ml_containers/blob/master/selinux.sh
The instructions here worked for me on Fedora 32, however the problem reappears if I specify --userns keep-id
:
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
Is that expected behaviour?
The instructions here worked for me on Fedora 32, however the problem reappears if I specify
--userns keep-id
:Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
Is that expected behaviour?
Make sure you have modified the file at: /etc/nvidia-container-runtime/config.toml
Everytime that the nvidia-container is updated it will reset the default values and you should change the values of:
#no-cgroups=false
no-cgroups = true
@Davidnet Even after the above modification, I am able to reproduce @invexed's error if I try to run the cuda-11 containers. Note the latest tag currently points to cuda 11.
$ podman run --rm --security-opt=label=disable nvidia/cuda:11.0-base-rc /usr/bin/nvidia-smi
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime erro
But not when trying to run a cuda 10.2 container or lower
$ podman run --rm --security-opt=label=disable nvidia/cuda:10.2-base /usr/bin/nvidia-smi
Sun Jul 12 15:57:40 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| 0% 60C P0 37W / 230W | 399MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Make sure you have modified the file at: /etc/nvidia-container-runtime/config.toml
Thanks for the reply. I have indeed modified this file. The container runs with podman run --rm --security-opt label=disable -u 0:0 container
, but podman run --rm --security-opt label=disable --userns keep-id -u $(id -u):$(id -g) container
results in the above error.
EDIT: I have CUDA 10.2 installed:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 33C P8 N/A / N/A | 42MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1565 G /usr/libexec/Xorg 20MiB |
| 0 2013 G /usr/libexec/Xorg 20MiB |
+-----------------------------------------------------------------------------+
EDIT: I have CUDA 10.2 installed:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A | | N/A 33C P8 N/A / N/A | 42MiB / 2004MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1565 G /usr/libexec/Xorg 20MiB | | 0 2013 G /usr/libexec/Xorg 20MiB | +-----------------------------------------------------------------------------+
You need a 450 driver to run CUDA 11.0 containers. The host CUDA version (or even none at all) doesn't matter, but the driver version does when running a CUDA container. nvidia-docker
makes this error more obvious compared to podman
. After updating your driver you should be able to run the container.
You need a 450 driver to run CUDA 11.0 containers. The host CUDA version (or even none at all) doesn't matter, but the driver version does when running a CUDA container.
nvidia-docker
makes this error more obvious compared topodman
. After updating your driver you should be able to run the container.
Apologies for the confusion, but I'm actually trying to run a CUDA 10.0.130 container. Updating the driver may fix @mjlbach's problem though.
To be more precise, I'm installing CUDA via https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux within an image based on archlinux
.
podman run --rm --security-opt label=disable -u $(id -u):$(id -g) --userns keep-id container
triggers Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
, but
podman run --rm --security-opt label=disable -u 0:0 container
does not. The problem seems to be related to the specification of --userns keep-id
.
You can add an SELinux policy, see here: https://github.com/mjlbach/podman_ml_containers/blob/master/selinux.sh
Interesting, per the link in that script to the DGX project, looks like nVidia has already solved SELinux woes on EL7 with nvidia-container
. There are plenty of warnings in that project about how it has only been tested on DGX running EL7, would be great if nVidia made this policy available for general use with EL7/EL8 and bundled it inside the nvidia-container-runtime
package(s).
That should allow us to use rootless podman with GPU acceleration without --security-opt label=disable
, but I don't know the security implications of said policy...
UPDATE: Requested that the DGX selinux update be made part of this package in Issue NVIDIA/nvidia-docker#121
Hi. Folks, I've hit this same wall as other person: https://github.com/NVIDIA/nvidia-container-toolkit/issues/182. Any idea why that would happen?
@zeroepoch You can add an SELinux policy, see here: https://github.com/mjlbach/podman_ml_containers/blob/master/selinux.sh
I finally got around to trying this SELinux module and it worked. I need to add --security-opt label=type:nvidia_container_t
still, but that should be more secure than disabling labels. What prompted this attempt to try again was that libpod.conf
was deprecated and I was converting my settings to ~/.config/containers/containers.conf
. I don't need anything in there anymore with this additional option. Now I just need to figure out how to make it default since I pretty much just run nvidia GPU containers.
For anyone who wants to disable labels still to make the CLI simpler, here are the contents of containers.conf
above:
[containers]
label = false
Issue or feature description rootless and rootful podman does not work with the nvidia plugin
Steps to reproduce the issue Install the nvidia plugin, configure it to run with podman execute the podman command and check if the devices is configured correctly.
Information to attach (optional if deemed irrelevant)
Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info Kernel version from uname -a Fedora 30 and later Any relevant kernel output lines from dmesg Driver information from nvidia-smi -a Docker version from docker version NVIDIA packages version from dpkg -l 'nvidia' or rpm -qa 'nvidia' NVIDIA container library version from nvidia-container-cli -V NVIDIA container library logs (see troubleshooting) Docker command, image and tag used
I am reporting this based on other users complaining. This is what they said.
We discovered that the ubuntu 18.04 machine needed a configuration change to get rootless working with nvidia: "no-cgroups = true" was set in /etc/nvidia-container-runtime/config.toml Unfortunately this config change did not work on Centos 7, but it did change the rootless error to: nvidia-container-cli: initialization error: cuda error: unknown error\\n\\"\""
This config change breaks podman running from root, with the error: Failed to initialize NVML: Unknown Error
Interestingly, root on ubuntu gets the same error even though rootless works.