NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

Plug in does not detect Tegra device Jetson Nano #377

Open VladoPortos opened 1 year ago

VladoPortos commented 1 year ago

1. Issue or feature description

Guys, I'm loosing my mind. I have k3s cluster running 3x rpi CM4 and one Jetson Nano.

Runtime environment was detected when I installed the K3s ok, and added to /var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = false
  enable_unprivileged_icmp = false
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

And I can run and detect the GPU in docker and containerd just fine:

docker run --rm --runtime nvidia xift/jetson_devicequery:r32.5.0 or ctr i pull docker.io/xift/jetson_devicequery:r32.5.0 ctr run --rm --gpus 0 --tty docker.io/xift/jetson_devicequery:r32.5.0 deviceQuery

Returns:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 3963 MBytes (4155203584 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

When I then install nvidia-device-plugin with: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

The DaemonSets will not detect GPU on the fourth node.

2023/02/03 10:16:37 Starting FS watcher.
2023/02/03 10:16:37 Starting OS watcher.
2023/02/03 10:16:37 Starting Plugins.
2023/02/03 10:16:37 Loading configuration.
2023/02/03 10:16:37 Updating config with default resource matching patterns.
2023/02/03 10:16:37 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/02/03 10:16:37 Retreiving plugins.
2023/02/03 10:16:37 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/03 10:16:37 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/02/03 10:16:37 Incompatible platform detected
2023/02/03 10:16:37 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2023/02/03 10:16:37 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023/02/03 10:16:37 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2023/02/03 10:16:37 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2023/02/03 10:16:37 No devices found. Waiting indefinitely.

I feel Im close but for life of me, i can't get this to work :(

I have tried to deploy and force the same image that worked locally to jetson, but it will fail:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-query
spec:
  restartPolicy: OnFailure
  nodeSelector:
    node-type: jetson
  containers:
  - name: nvidia-query
    image: xift/jetson_devicequery:r32.5.0
    command: [ "./deviceQuery" ]
root@cube01:~# kubectl logs nvidia-query
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

I almost feels like there are two containerd one that works when I use ctr on that node, and one separated for k3s, or something... can't explain why using the same containerd engine produce two different results.

3. Information to attach (optional if deemed irrelevant)

vladoportos@cube04:/sys/devices/gpu.0$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                                       Version                            Architecture                       Description
+++-==========================================================-==================================-==================================-=========================================================================================================================
un  libgldispatch0-nvidia                                      <none>                             <none>                             (no description available)
ii  libnvidia-container-tools                                  1.7.0-1                            arm64                              NVIDIA container runtime library (command-line tools)
ii  libnvidia-container0:arm64                                 0.10.0+jetpack                     arm64                              NVIDIA container runtime library
ii  libnvidia-container1:arm64                                 1.7.0-1                            arm64                              NVIDIA container runtime library
un  nvidia-304                                                 <none>                             <none>                             (no description available)
un  nvidia-340                                                 <none>                             <none>                             (no description available)
un  nvidia-384                                                 <none>                             <none>                             (no description available)
un  nvidia-common                                              <none>                             <none>                             (no description available)
ii  nvidia-container-csv-cuda                                  10.2.460-1                         arm64                              Jetpack CUDA CSV file
ii  nvidia-container-csv-cudnn                                 8.2.1.32-1+cuda10.2                arm64                              Jetpack CUDNN CSV file
ii  nvidia-container-csv-tensorrt                              8.2                                arm64                              Jetpack TensorRT CSV file
ii  nvidia-container-runtime                                   3.7.0-1                            all                                NVIDIA container runtime
un  nvidia-container-runtime-hook                              <none>                             <none>                             (no description available)
ii  nvidia-container-toolkit                                   1.7.0-1                            arm64                              NVIDIA container runtime hook
un  nvidia-docker                                              <none>                             <none>                             (no description available)
ii  nvidia-docker2                                             2.8.0-1                            all                                nvidia-docker CLI wrapper
ii  nvidia-l4t-3d-core                                         32.7.3-20221122092935              arm64                              NVIDIA GL EGL Package
ii  nvidia-l4t-apt-source                                      32.7.3-20221122092935              arm64                              NVIDIA L4T apt source list debian package
ii  nvidia-l4t-bootloader                                      32.7.3-20221122092935              arm64                              NVIDIA Bootloader Package
ii  nvidia-l4t-camera                                          32.7.3-20221122092935              arm64                              NVIDIA Camera Package
un  nvidia-l4t-ccp-t210ref                                     <none>                             <none>                             (no description available)
ii  nvidia-l4t-configs                                         32.7.3-20221122092935              arm64                              NVIDIA configs debian package
ii  nvidia-l4t-core                                            32.7.3-20221122092935              arm64                              NVIDIA Core Package
ii  nvidia-l4t-cuda                                            32.7.3-20221122092935              arm64                              NVIDIA CUDA Package
ii  nvidia-l4t-firmware                                        32.7.3-20221122092935              arm64                              NVIDIA Firmware Package
ii  nvidia-l4t-gputools                                        32.7.3-20221122092935              arm64                              NVIDIA dgpu helper Package
ii  nvidia-l4t-graphics-demos                                  32.7.3-20221122092935              arm64                              NVIDIA graphics demo applications
ii  nvidia-l4t-gstreamer                                       32.7.3-20221122092935              arm64                              NVIDIA GST Application files
ii  nvidia-l4t-init                                            32.7.3-20221122092935              arm64                              NVIDIA Init debian package
ii  nvidia-l4t-initrd                                          32.7.3-20221122092935              arm64                              NVIDIA initrd debian package
ii  nvidia-l4t-jetson-io                                       32.7.3-20221122092935              arm64                              NVIDIA Jetson.IO debian package
ii  nvidia-l4t-jetson-multimedia-api                           32.7.3-20221122092935              arm64                              NVIDIA Jetson Multimedia API is a collection of lower-level APIs that support flexible application development.
ii  nvidia-l4t-kernel                                          4.9.299-tegra-32.7.3-2022112209293 arm64                              NVIDIA Kernel Package
ii  nvidia-l4t-kernel-dtbs                                     4.9.299-tegra-32.7.3-2022112209293 arm64                              NVIDIA Kernel DTB Package
ii  nvidia-l4t-kernel-headers                                  4.9.299-tegra-32.7.3-2022112209293 arm64                              NVIDIA Linux Tegra Kernel Headers Package
ii  nvidia-l4t-libvulkan                                       32.7.3-20221122092935              arm64                              NVIDIA Vulkan Loader Package
ii  nvidia-l4t-multimedia                                      32.7.3-20221122092935              arm64                              NVIDIA Multimedia Package
ii  nvidia-l4t-multimedia-utils                                32.7.3-20221122092935              arm64                              NVIDIA Multimedia Package
ii  nvidia-l4t-oem-config                                      32.7.3-20221122092935              arm64                              NVIDIA OEM-Config Package
ii  nvidia-l4t-tools                                           32.7.3-20221122092935              arm64                              NVIDIA Public Test Tools Package
ii  nvidia-l4t-wayland                                         32.7.3-20221122092935              arm64                              NVIDIA Wayland Package
ii  nvidia-l4t-weston                                          32.7.3-20221122092935              arm64                              NVIDIA Weston Package
ii  nvidia-l4t-x11                                             32.7.3-20221122092935              arm64                              NVIDIA X11 Package
ii  nvidia-l4t-xusb-firmware                                   32.7.3-20221122092935              arm64                              NVIDIA USB Firmware Package
un  nvidia-libopencl1-dev                                      <none>                             <none>                             (no description available)
un  nvidia-prime                                               <none>                             <none>                             (no description available)
vladoportos@cube04:/sys/devices/gpu.0$ nvidia-container-cli -V
cli-version: 1.7.0
lib-version: 0.10.0+jetpack
build date: 2021-11-30T19:53+00:00
build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
build compiler: aarch64-linux-gnu-gcc-7 7.5.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
root@cube04:~# systemctl status k3s-agent
● k3s-agent.service - Lightweight Kubernetes
   Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2023-02-03 10:08:05 CET; 1h 16min ago
     Docs: https://k3s.io
  Process: 5445 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
  Process: 5440 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Process: 5418 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
 Main PID: 5446 (k3s-agent)
    Tasks: 116
   CGroup: /system.slice/k3s-agent.service
           ├─ 5446 /usr/local/bin/k3s agent
           ├─ 5478 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
           ├─ 7283 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id dedf850196ac01cba261dd25152e1ec1081487e0027c9cd7335280b9046cb754 -address /run/k3s/co
           ├─ 7427 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 8ccc0ae4a3d12bd447314bcdddc78524412c67d12d4d21dabc4b43fc6c4e5557 -address /run/k3s/co
           ├─ 8325 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id f4e8054e372b87f4799fad81b0aa15c187dc2abc1e64161c66950679048d3219 -address /run/k3s/co
           ├─ 9034 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id caa99803264ae2123a9dcd0b0600dc314dfbe052ae21ffad53d0b91826379d3c -address /run/k3s/co
           ├─ 9223 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 207f091f8498b63b7f7e14d6533216c75728eb57ecae1d1dc358e7b8bcf9ad76 -address /run/k3s/co
           └─16746 /containerd/var-k3s-rancher/k3s/data/9088e57b1ba3c37820aaba60202af921dbc01b77ec0ad1e08be86b5c7bc9b8c1/bin/containerd-shim-runc-v2 -namespace k8s.io -id ebecb9818d2a4356fe7d60e601f45b6794477c7eb27c9455440691ce5a9d64ad -address /run/k3s/co

Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.065894    5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.065973    5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.066050    5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.066124    5446 memory_manager.go:345] "RemoveStaleState removing state" podUID="fd3b03ff-05a8-4347-bd4e-e3b950a96921" containerName="nvidia-query"
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.239639    5446 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-kkbwv\" (UniqueName: \"kubernetes.io/projected/ac4fe6a7-039b-44db-b02f-b992f3463d0d-ku
Feb 03 11:16:36 cube04 k3s[5446]: I0203 11:16:36.241210    5446 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"device-plugin\" (UniqueName: \"kubernetes.io/host-path/ac4fe6a7-039b-44db-b02f-b992f3463d0d-device-plu
Feb 03 11:18:05 cube04 k3s[5446]: W0203 11:18:05.787683    5446 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Feb 03 11:18:05 cube04 k3s[5446]: W0203 11:18:05.790051    5446 machine.go:65] Cannot read vendor id correctly, set empty.
Feb 03 11:23:05 cube04 k3s[5446]: W0203 11:23:05.786897    5446 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Feb 03 11:23:05 cube04 k3s[5446]: W0203 11:23:05.788848    5446 machine.go:65] Cannot read vendor id correctly, set empty.
klueska commented 1 year ago

Your containerd config is not setting nvidia as the default runtime. The only reason ctr works is that it goes through a different path (I.e. not the CRI plugin like Kubernetes does), and does not require nvidia be set as the default runtime to work (it keys off of the fact that you passed —gpus to know what to do with the nvidia tooling).

VladoPortos commented 1 year ago

@klueska Ah, ok I have edited the containerd to use nvidia as default but that kind of moved me forward but still failed:

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"
Name:             nvidia-query
Namespace:        default
Priority:         0
Service Account:  default
Node:             cube04/10.0.0.63
Start Time:       Fri, 03 Feb 2023 11:52:26 +0100
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               10.42.1.13
IPs:
  IP:  10.42.1.13
Containers:
  nvidia-query:
    Container ID:  containerd://a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07
    Image:         xift/jetson_devicequery:r32.5.0
    Image ID:      docker.io/xift/jetson_devicequery@sha256:8a4db3a25008e9ae2ce265b70389b53110b7625eaef101794af05433024c47ee
    Port:          <none>
    Host Port:     <none>
    Command:
      ./deviceQuery
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    StartError
      Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout:
src: /etc/vulkan/icd.d/nvidia_icd.json, src_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/etc/vulkan/i
cd.d/nvidia_icd.json, dst_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json
src: /usr/lib/aarch64-linux-gnu/libcuda.so, src_lnk: tegra/libcuda.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libcuda.so, ds
t_lnk: tegra/libcuda.so
src: /usr/lib/aarch64-linux-gnu/libdrm_nvdc.so, src_lnk: tegra/libdrm.so.2, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/libdrm_nv
dc.so, dst_lnk: tegra/libdrm.so.2
src: /usr/lib/aarch64-linux-gnu/libv4l2.so.0.0.999999, src_lnk: tegra/libnvv4l2.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/l
ibv4l2.so.0.0.999999, dst_lnk: tegra/libnvv4l2.so
src: /usr/lib/aarch64-linux-gnu/libv4lconvert.so.0.0.999999, src_lnk: tegra/libnvv4lconvert.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64
-linux-gnu/libv4lconvert.so.0.0.999999, dst_lnk: tegra/libnvv4lconvert.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, src_lnk: ../../../tegra/libv4l2_nvargus.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/root
fs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, dst_lnk: ../../../tegra/libv4l2_nvargus.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, src_lnk: ../../../tegra/libv4l2_nvvidconv.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/
rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, dst_lnk: ../../../tegra/libv4l2_nvvidconv.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, src_lnk: ../../../tegra/libv4l2_nvvideocodec.so, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02
cae07/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, dst_lnk: ../../../tegra/libv4l2_nvvideocodec.so
src: /usr/lib/aarch64-linux-gnu/libvulkan.so.1.2.141, src_lnk: tegra/libvulkan.so.1.2.141, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linu
x-gnu/libvulkan.so.1.2.141, dst_lnk: tegra/libvulkan.so.1.2.141
src: /usr/lib/aarch64-linux-gnu/tegra/libcuda.so, src_lnk: libcuda.so.1.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/a0a75a4c6ed43de9d1191d01308c3e00b296149c4091676bbecdea5bc02cae07/rootfs/usr/lib/aarch64-linux-gnu/tegra/libc
uda.so, dst_lnk: libcuda.so.1.1

And the DaemonSet fails with:

src: /usr/lib/aarch64-linux-gnu/libcudnn_static.a, src_lnk: /etc/alternatives/libcudnn_stlib, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_static.a, dst_lnk: /etc/alternatives/libcudnn_stlib
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so.8, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so.8, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so.8, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so.8, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, src_lnk: libnvonnxparser.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, dst_lnk: libnvonnxparser.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so, src_lnk: libnvinfer.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so, src_lnk: libnvparsers.so.8.2.1, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so, src_lnk: libnvonnxparser.so.8, dst: /containerd/run-k3s/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so, dst_lnk: libnvonnxparser.so.8
, stderr: nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/system.slice/k3s-agent.service/kubepods-besteffort-pod541c5001_1e8f_4e6a_9976_ffd80e364373.slice/devices.allow: no such file or directory: unknown
  Warning  BackOff  8s (x8 over 107s)  kubelet  Back-off restarting failed container
klueska commented 1 year ago

@elezar is this most recent error fixed by the new toolkit?

@VladoPortos while we wait for Evan to confirm, can you try installing the latest RC of the nvidia-container-toolkit (I believe it’s 1.12-rc.5) to see if this resolves your issue.

VladoPortos commented 1 year ago

Holly Cow ! it worked ! Thansk soooo much.

I can confirm, updating the repo to experimental and installing:

nvidia-container-toolkit (1.12.0~rc.5-1) 
nvidia-container-runtime (3.11.0-1)
nvidia-docker2 (2.11.0-1)

Now the container in k3s works and returns:

root@cube01:~# kubectl logs nvidia-query
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3963 MBytes (4155203584 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Same for the nvidia plugin:

root@cube01:~# kubectl logs nvidia-device-plugin-daemonset-d7zj6 -n kube-system
2023/02/03 11:21:41 Starting FS watcher.
2023/02/03 11:21:41 Starting OS watcher.
2023/02/03 11:21:41 Starting Plugins.
2023/02/03 11:21:41 Loading configuration.
2023/02/03 11:21:41 Updating config with default resource matching patterns.
2023/02/03 11:21:41 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/02/03 11:21:41 Retreiving plugins.
2023/02/03 11:21:41 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/03 11:21:41 Detected Tegra platform: /etc/nv_tegra_release found
2023/02/03 11:21:41 Starting GRPC server for 'nvidia.com/gpu'
2023/02/03 11:21:41 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/02/03 11:21:41 Registered device plugin for 'nvidia.com/gpu' with Kubelet
klueska commented 1 year ago

Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.

ByerRA commented 1 year ago

Great to hear. We should actually be pushing the GA release of 1.12 later today, so you don’t have to run off the RC for long.

Any timeline on when the release of 1.12 will happen as I don't see it when I do an apt update.

klueska commented 1 year ago

It was released last Friday.

elezar commented 1 year ago

Note that looking at the initial logs that you provided you may have been using v1.7.0 of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the v1.10.0 release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least v1.11.0 of the NVIDIA Container Toolkit is required.

There are no Tegra-specific changes in the v1.12.0 release, so using the v1.11.0 release should be sufficient in this case.

ByerRA commented 1 year ago

It appears that I have v1.7.0 of the NVIDIA Container Toolkit and when I do an "apt upgrade" I'm not seeing any newer versions of the NVIDIA Container Toolkit. How does one get one of the newer versions of the NVIDIA Container Toolkit that will allow this to work with Jetson Nano?

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.