Open VladoPortos opened 1 year ago
Your containerd config is not setting nvidia as the default runtime. The only reason ctr works is that it goes through a different path (I.e. not the CRI plugin like Kubernetes does), and does not require nvidia be set as the default runtime to work (it keys off of the fact that you passed —gpus to know what to do with the nvidia tooling).
@klueska Ah, ok I have edited the containerd to use nvidia as default but that kind of moved me forward but still failed:
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
default_runtime_name = "nvidia"
Name: nvidia-query
Namespace: default
Priority: 0
Service Account: default
Node: cube04/10.0.0.63
Start Time: Fri, 03 Feb 2023 11:52:26 +0100
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.42.1.13
IPs:
IP: 10.42.1.13
Containers:
nvidia-query:
Container ID: containerd://a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07
Image: xift/jetson_devicequery:r32.5.0
Image ID: docker.io/xift/jetson_devicequery@sha256:8a4db3a25008e9ae2ce265b70389b53110b7625eaef101794af05433024c47ee
Port: <none>
Host Port: <none>
Command:
./deviceQuery
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout:
src: /etc/vulkan/icd.d/nvidia_icd.json, src_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/etc/vulkan/i
cd.d/nvidia_icd.json, dst_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json
src: /usr/lib/aarch64-linux-gnu/libcuda.so, src_lnk: tegra/libcuda.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libcuda.so, ds
t_lnk: tegra/libcuda.so
src: /usr/lib/aarch64-linux-gnu/libdrm_nvdc.so, src_lnk: tegra/libdrm.so.2, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libdrm_nv
dc.so, dst_lnk: tegra/libdrm.so.2
src: /usr/lib/aarch64-linux-gnu/libv4l2.so.0.0.999999, src_lnk: tegra/libnvv4l2.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/l
ibv4l2.so.0.0.999999, dst_lnk: tegra/libnvv4l2.so
src: /usr/lib/aarch64-linux-gnu/libv4lconvert.so.0.0.999999, src_lnk: tegra/libnvv4lconvert.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64
-linux-gnu/libv4lconvert.so.0.0.999999, dst_lnk: tegra/libnvv4lconvert.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, src_lnk: ../../../tegra/libv4l2_nvargus.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/root
fs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, dst_lnk: ../../../tegra/libv4l2_nvargus.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, src_lnk: ../../../tegra/libv4l2_nvvidconv.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/
rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, dst_lnk: ../../../tegra/libv4l2_nvvidconv.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, src_lnk: ../../../tegra/libv4l2_nvvideocodec.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02
cae07/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, dst_lnk: ../../../tegra/libv4l2_nvvideocodec.so
src: /usr/lib/aarch64-linux-gnu/libvulkan.so.1.2.141, src_lnk: tegra/libvulkan.so.1.2.141, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linu
x-gnu/libvulkan.so.1.2.141, dst_lnk: tegra/libvulkan.so.1.2.141
src: /usr/lib/aarch64-linux-gnu/tegra/libcuda.so, src_lnk: libcuda.so.1.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/tegra/libc
uda.so, dst_lnk: libcuda.so.1.1
And the DaemonSet fails with:
src: /usr/lib/aarch64-linux-gnu/libcudnn_static.a, src_lnk: /etc/alternatives/libcudnn_stlib, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_static.a, dst_lnk: /etc/alternatives/libcudnn_stlib
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so.8, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so.8, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so.8, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so.8, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, src_lnk: libnvonnxparser.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, dst_lnk: libnvonnxparser.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so, src_lnk: libnvonnxparser.so.8, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so, dst_lnk: libnvonnxparser.so.8
, stderr: nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/system.slice/k3s-agent.service/kubepods-besteffort-pod541c5001_1e8f_4e6a_9976_ffd80e364373.slice/devices.allow: no such file or directory: unknown
Warning BackOff 8s (x8 over 107s) kubelet Back-off restarting failed container
@elezar is this most recent error fixed by the new toolkit?
@VladoPortos while we wait for Evan to confirm, can you try installing the latest RC of the nvidia-container-toolkit (I believe it’s 1.12-rc.5) to see if this resolves your issue.
Holly Cow ! it worked ! Thansk soooo much.
I can confirm, updating the repo to experimental and installing:
nvidia-container-toolkit (1.12.0~rc.5-1)
nvidia-container-runtime (3.11.0-1)
nvidia-docker2 (2.11.0-1)
Now the container in k3s works and returns:
root@cube01:~# kubectl logs nvidia-query
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3963 MBytes (4155203584 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Same for the nvidia plugin:
root@cube01:~# kubectl logs nvidia-device-plugin-daemonset-d7zj6 -n kube-system
2023/02/03 11:21:41 Starting FS watcher.
2023/02/03 11:21:41 Starting OS watcher.
2023/02/03 11:21:41 Starting Plugins.
2023/02/03 11:21:41 Loading configuration.
2023/02/03 11:21:41 Updating config with default resource matching patterns.
2023/02/03 11:21:41
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2023/02/03 11:21:41 Retreiving plugins.
2023/02/03 11:21:41 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/03 11:21:41 Detected Tegra platform: /etc/nv_tegra_release found
2023/02/03 11:21:41 Starting GRPC server for 'nvidia.com/gpu'
2023/02/03 11:21:41 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/02/03 11:21:41 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.
Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.
Any timeline on when the release of 1.12 will happen as I don't see it when I do an apt update.
It was released last Friday.
Note that looking at the initial logs that you provided you may have been using v1.7.0
of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the v1.10.0
release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least v1.11.0
of the NVIDIA Container Toolkit is required.
There are no Tegra-specific changes in the v1.12.0
release, so using the v1.11.0
release should be sufficient in this case.
It appears that I have v1.7.0 of the NVIDIA Container Toolkit and when I do an "apt upgrade" I'm not seeing any newer versions of the NVIDIA Container Toolkit. How does one get one of the newer versions of the NVIDIA Container Toolkit that will allow this to work with Jetson Nano?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Issue or feature description
Guys, I'm loosing my mind. I have k3s cluster running 3x rpi CM4 and one Jetson Nano.
Runtime environment was detected when I installed the K3s ok, and added to
/var/lib/rancher/k3s/agent/etc/containerd/config.toml
And I can run and detect the GPU in docker and containerd just fine:
docker run --rm --runtime nvidia xift/jetson_devicequery:r32.5.0 or ctr i pull docker.io/xift/jetson_devicequery:r32.5.0 ctr run --rm --gpus 0 --tty docker.io/xift/jetson_devicequery:r32.5.0 deviceQuery
Returns:
When I then install nvidia-device-plugin with:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
The DaemonSets will not detect GPU on the fourth node.
I feel Im close but for life of me, i can't get this to work :(
I have tried to deploy and force the same image that worked locally to jetson, but it will fail:
I almost feels like there are two containerd one that works when I use ctr on that node, and one separated for k3s, or something... can't explain why using the same containerd engine produce two different results.
3. Information to attach (optional if deemed irrelevant)