NVIDIA / nvidia-container-runtime

NVIDIA container runtime
Apache License 2.0
1.11k stars 159 forks source link

"Command is not "create", executing runc doing nothing" #83

Closed Mercy811 closed 1 year ago

Mercy811 commented 5 years ago

i ran a nmtwizard/opennmt-tf container. But i only ran for a few seconds and stop automatically.

files to attache

/var/log/nvidia-container-runtime.log

2019/10/14 16:29:58 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:29:58 Using bundle file: /run/containerd/io.containerd.runtime.v1.linux/moby/47e9f2194316d67a60bfb9ec9eebbdcdf723c66b41f34945125eaed9c08e9ddb/config.json
2019/10/14 16:29:58 prestart hook path: /usr/bin/nvidia-container-runtime-hook
2019/10/14 16:29:58 existing nvidia prestart hook in OCI spec file
2019/10/14 16:29:58 Prestart hook added, executing runc
2019/10/14 16:29:58 Looking for "docker-runc" binary
2019/10/14 16:29:58 "docker-runc" binary not found
2019/10/14 16:29:58 Looking for "runc" binary
2019/10/14 16:29:58 Runc path: /usr/bin/runc
2019/10/14 16:29:59 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:29:59 Command is not "create", executing runc doing nothing
2019/10/14 16:29:59 Looking for "docker-runc" binary
2019/10/14 16:29:59 "docker-runc" binary not found
2019/10/14 16:29:59 Looking for "runc" binary
2019/10/14 16:29:59 Runc path: /usr/bin/runc
2019/10/14 16:29:59 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:29:59 Command is not "create", executing runc doing nothing
2019/10/14 16:29:59 Looking for "docker-runc" binary
2019/10/14 16:29:59 "docker-runc" binary not found
2019/10/14 16:29:59 Looking for "runc" binary
2019/10/14 16:29:59 Runc path: /usr/bin/runc
2019/10/14 16:29:59 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:29:59 Command is not "create", executing runc doing nothing
2019/10/14 16:29:59 Looking for "docker-runc" binary
2019/10/14 16:29:59 "docker-runc" binary not found
2019/10/14 16:29:59 Looking for "runc" binary
2019/10/14 16:29:59 Runc path: /usr/bin/runc
2019/10/14 16:29:59 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:29:59 Command is not "create", executing runc doing nothing
2019/10/14 16:29:59 Looking for "docker-runc" binary
2019/10/14 16:29:59 "docker-runc" binary not found
2019/10/14 16:29:59 Looking for "runc" binary
2019/10/14 16:29:59 Runc path: /usr/bin/runc
2019/10/14 16:30:01 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:30:01 Command is not "create", executing runc doing nothing
2019/10/14 16:30:01 Looking for "docker-runc" binary
2019/10/14 16:30:01 "docker-runc" binary not found
2019/10/14 16:30:01 Looking for "runc" binary
2019/10/14 16:30:01 Runc path: /usr/bin/runc
2019/10/14 16:30:01 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:30:01 Command is not "create", executing runc doing nothing
2019/10/14 16:30:01 Looking for "docker-runc" binary
2019/10/14 16:30:01 "docker-runc" binary not found
2019/10/14 16:30:01 Looking for "runc" binary
2019/10/14 16:30:01 Runc path: /usr/bin/runc
2019/10/14 16:30:01 Running /usr/bin/nvidia-container-runtime
2019/10/14 16:30:01 Command is not "create", executing runc doing nothing
2019/10/14 16:30:01 Looking for "docker-runc" binary
2019/10/14 16:30:01 "docker-runc" binary not found
2019/10/14 16:30:01 Looking for "runc" binary
2019/10/14 16:30:01 Runc path: /usr/bin/runc

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
andrewssobral commented 4 years ago

same for me

docker run --runtime=nvidia --privileged nvidia/cuda nvidia-smi works fine but podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi crashes

$ podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
2020/04/03 13:34:52 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
Error: `/usr/bin/nvidia-container-runtime start e3ccb660bf27ce0858ee56476e58b53cd3dc900e8de80f08d10f3f844c0e9f9a` failed: exit status 1
$ podman --version
podman version 1.8.2
$ cat ~/.config/containers/libpod.conf
# libpod.conf is the default configuration file for all tools using libpod to
# manage containers

# Default transport method for pulling and pushing for images
image_default_transport = "docker://"

# Paths to look for the conmon container manager binary.
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
conmon_path = [
            "/usr/libexec/podman/conmon",
            "/usr/local/libexec/podman/conmon",
            "/usr/local/lib/podman/conmon",
            "/usr/bin/conmon",
            "/usr/sbin/conmon",
            "/usr/local/bin/conmon",
            "/usr/local/sbin/conmon",
            "/run/current-system/sw/bin/conmon",
]

# Environment variables to pass into conmon
conmon_env_vars = [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]

# CGroup Manager - valid values are "systemd" and "cgroupfs"
#cgroup_manager = "systemd"

# Container init binary
#init_path = "/usr/libexec/podman/catatonit"

# Directory for persistent libpod files (database, etc)
# By default, this will be configured relative to where containers/storage
# stores containers
# Uncomment to change location from this default
#static_dir = "/var/lib/containers/storage/libpod"

# Directory for temporary files. Must be tmpfs (wiped after reboot)
#tmp_dir = "/var/run/libpod"
tmp_dir = "/run/user/1000/libpod/tmp"

# Maximum size of log files (in bytes)
# -1 is unlimited
max_log_size = -1

# Whether to use chroot instead of pivot_root in the runtime
no_pivot_root = false

# Directory containing CNI plugin configuration files
cni_config_dir = "/etc/cni/net.d/"

# Directories where the CNI plugin binaries may be located
cni_plugin_dir = [
               "/usr/libexec/cni",
               "/usr/lib/cni",
               "/usr/local/lib/cni",
               "/opt/cni/bin"
]

# Default CNI network for libpod.
# If multiple CNI network configs are present, libpod will use the network with
# the name given here for containers unless explicitly overridden.
# The default here is set to the name we set in the
# 87-podman-bridge.conflist included in the repository.
# Not setting this, or setting it to the empty string, will use normal CNI
# precedence rules for selecting between multiple networks.
cni_default_network = "podman"

# Default libpod namespace
# If libpod is joined to a namespace, it will see only containers and pods
# that were created in the same namespace, and will create new containers and
# pods in that namespace.
# The default namespace is "", which corresponds to no namespace. When no
# namespace is set, all containers and pods are visible.
#namespace = ""

# Default infra (pause) image name for pod infra containers
infra_image = "k8s.gcr.io/pause:3.1"

# Default command to run the infra container
infra_command = "/pause"

# Determines whether libpod will reserve ports on the host when they are
# forwarded to containers. When enabled, when ports are forwarded to containers,
# they are held open by conmon as long as the container is running, ensuring that
# they cannot be reused by other programs on the host. However, this can cause
# significant memory usage if a container has many ports forwarded to it.
# Disabling this can save memory.
#enable_port_reservation = true

# Default libpod support for container labeling
# label=true

# The locking mechanism to use
lock_type = "shm"

# Number of locks available for containers and pods.
# If this is changed, a lock renumber must be performed (e.g. with the
# 'podman system renumber' command).
num_locks = 2048

# Directory for libpod named volumes.
# By default, this will be configured relative to where containers/storage
# stores containers.
# Uncomment to change location from this default.
#volume_path = "/var/lib/containers/storage/volumes"

# Selects which logging mechanism to use for Podman events.  Valid values
# are `journald` or `file`.
# events_logger = "journald"

# Specify the keys sequence used to detach a container.
# Format is a single character [a-Z] or a comma separated sequence of
# `ctrl-<value>`, where `<value>` is one of:
# `a-z`, `@`, `^`, `[`, `\`, `]`, `^` or `_`
#
# detach_keys = "ctrl-p,ctrl-q"

# Default OCI runtime
runtime = "runc"

# List of the OCI runtimes that support --format=json.  When json is supported
# libpod will use it for reporting nicer errors.
runtime_supports_json = ["crun", "runc"]

# List of all the OCI runtimes that support --cgroup-manager=disable to disable
# creation of CGroups for containers.
runtime_supports_nocgroups = ["crun"]

# Paths to look for a valid OCI runtime (runc, runv, etc)
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
[runtimes]
runc = [
            "/usr/bin/runc",
            "/usr/sbin/runc",
            "/usr/local/bin/runc",
            "/usr/local/sbin/runc",
            "/sbin/runc",
            "/bin/runc",
            "/usr/lib/cri-o-runc/sbin/runc",
            "/run/current-system/sw/bin/runc",
]

crun = [
                "/usr/bin/crun",
                "/usr/sbin/crun",
                "/usr/local/bin/crun",
                "/usr/local/sbin/crun",
                "/sbin/crun",
                "/bin/crun",
                "/run/current-system/sw/bin/crun",
]

nvidia = ["/usr/bin/nvidia-container-runtime"]

# Kata Containers is an OCI runtime, where containers are run inside lightweight
# Virtual Machines (VMs). Kata provides additional isolation towards the host,
# minimizing the host attack surface and mitigating the consequences of
# containers breakout.
# Please notes that Kata does not support rootless podman yet, but we can leave
# the paths below blank to let them be discovered by the $PATH environment
# variable.

# Kata Containers with the default configured VMM
kata-runtime = [
    "/usr/bin/kata-runtime",
]

# Kata Containers with the QEMU VMM
kata-qemu = [
    "/usr/bin/kata-qemu",
]

# Kata Containers with the Firecracker VMM
kata-fc = [
    "/usr/bin/kata-fc",
]

# The [runtimes] table MUST be the last thing in this file.
# (Unless another table is added)
# TOML does not provide a way to end a table other than a further table being
# defined, so every key hereafter will be part of [runtimes] and not the main
# config.
$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "/tmp/nvidia-container-runtime.log
$ cat /tmp/nvidia-container-runtime.log
2020/04/03 13:23:02 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:02 Using bundle file: /home/andrews/.local/share/containers/storage/vfs-containers/614cb26f8f4719e3aba56be2e1a6dc29cd91ae760d9fe3bf83d6d1b24becc638/userdata/config.json
2020/04/03 13:23:02 prestart hook path: /usr/bin/nvidia-container-runtime-hook
2020/04/03 13:23:02 Prestart hook added, executing runc
2020/04/03 13:23:02 Looking for "docker-runc" binary
2020/04/03 13:23:02 "docker-runc" binary not found
2020/04/03 13:23:02 Looking for "runc" binary
2020/04/03 13:23:02 Runc path: /usr/bin/runc
2020/04/03 13:23:09 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:09 Command is not "create", executing runc doing nothing
2020/04/03 13:23:09 Looking for "docker-runc" binary
2020/04/03 13:23:09 "docker-runc" binary not found
2020/04/03 13:23:09 Looking for "runc" binary
2020/04/03 13:23:09 ERROR: find runc path: exec: "runc": executable file not found in $PATH
2020/04/03 13:31:06 Running nvidia-container-runtime
2020/04/03 13:31:06 Command is not "create", executing runc doing nothing
2020/04/03 13:31:06 Looking for "docker-runc" binary
2020/04/03 13:31:06 "docker-runc" binary not found
2020/04/03 13:31:06 Looking for "runc" binary
2020/04/03 13:31:06 Runc path: /usr/bin/runc
$ nvidia-container-runtime --version
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
NVRM version:   440.64.00
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          GeForce RTX 2070
Brand:          GeForce
GPU UUID:       GPU-22dfd02e-a668-a6a6-a90a-39d6efe475ee
Bus Location:   00000000:01:00.0
Architecture:   7.5
$ whereis runc
runc: /usr/bin/runc
$ whereis docker-runc
docker-runc:
$ docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:19 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
PriamX commented 2 years ago

Getting these errors too. Every 5 minutes the following shows up in /var/log/nvidia-container-runtime.log

[root@mediaserv log]# tail -8 nvidia-container-runtime.log
2022/03/11 04:55:54 Running nvidia-container-runtime
2022/03/11 04:55:54 Command is not "create", executing runc doing nothing
2022/03/11 04:55:54 Looking for "docker-runc" binary
2022/03/11 04:55:54 Runc path: /usr/bin/docker-runc
2022/03/11 04:55:59 Running nvidia-container-runtime
2022/03/11 04:55:59 Command is not "create", executing runc doing nothing
2022/03/11 04:55:59 Looking for "docker-runc" binary
2022/03/11 04:55:59 Runc path: /usr/bin/docker-runc
[root@mediaserv log]#

However, the containers using the nvidia runtime seem to be working with the nvidia hardware just fine.

As a workaround, to keep it from filling my partition, I set up a logrotate rule to drop the log file daily:

[root@mediaserv ~]# cat /etc/logrotate.d/nvidia
/var/log/nvidia-container-runtime.log
{
    daily
    rotate 0
    nocreate
    notifempty
    missingok
}
[root@mediaserv ~]#
elezar commented 2 years ago

These are not errors, but indicating the nvidia-container-runtime is searching the path for the specified runc executables. @andrewssobral in the podman case you should ensure that one of these (i.e. runc) is in the path. We are looking at making this more configurable through a config option in the near future.

@Mercy811 with regards to the command is not "create" message that you're seeing on the logs. All runc (sub-)commands are forwarded by docker (or podman) to the nvidia-container-runtime but only the create subcommand requires that the OCI spec be modified. We will look at whether we can improve the logging around this to make it less verbose.

PriamX commented 2 years ago

@elezar Thank you for the explanation. I had assumed that "command is not create" was some kind of error. It's good to know it's working as designed. The verbose logging is a minor hassle, and no hassle after setting up the logrotate, but I do appreciate you're looking at improving the logging.

elezar commented 2 years ago

We have just released v1.10.0-rc.2 of the NVIDIA Container Toolkit to our experimental repositories. This offers some enhancements relevant to this issue:

@Mercy811 could you check whether this addresses your original issue.

elezar commented 2 years ago

We have released v1.10.0 which improves the logging for the NVIDIA Container Runtime. Please test and close this issue if applicable.

elezar commented 1 year ago

For additional notes on using the NVIDIA Container Toolkit with Podman, please see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman