NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
287 stars 47 forks source link

GFD returns 'no labels generated from any source' #36

Closed MichaelJendryke closed 7 months ago

MichaelJendryke commented 1 year ago

Dear all,

I have a setup of k3s and rancher on three nodes. One node has two Tesla T4 GPUs.

Running nvidia-smi on the node directly returns

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:41:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:A1:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Which tells me that the diver is installed correctly and I can proceed in the k3s guide.

The content of /var/lib/rancher/k3s/agent/etc/containerd/config.toml is,

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

, where I added the line default_runtime_name = "nvidia"

I continue with

  1. Node Feature Discovery (NFD)

    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/nfd.yaml

    and also

    kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.12.1

    which shows the following result in the logs of nfd

    REQUEST Node: geo-node1
    NFD-version: v0.6.0
    Labels: map[
    cpu-cpuid.ADX:true
    cpu-cpuid.AESNI:true
    cpu-cpuid.AVX:true
    cpu-cpuid.AVX2:true
    cpu-cpuid.FMA3:true
    cpu-cpuid.SHA:true
    cpu-cpuid.SSE4A:true
    cpu-hardware_multithreading:true
    cpu-rdt.RDTCMT:true
    cpu-rdt.RDTL3CA:true
    cpu-rdt.RDTMBM:true
    cpu-rdt.RDTMON:true
    iommu-enabled:true
    kernel-config.NO_HZ:true
    kernel-config.NO_HZ_IDLE:true
    kernel-version.full:5.15.0-67-generic
    kernel-version.major:5
    kernel-version.minor:15
    kernel-version.revision:0
    memory-numa:true
    nvidia.com/gfd.timestamp:1679476204
    pci-102b.present:true
    pci-10de.present:true
    pci-10de.sriov.capable:true
    storage-nonrotationaldisk:true
    system-os_release.ID:ubuntu
    system-os_release.VERSION_ID:22.04
    system-os_release.VERSION_ID.major:22
    system-os_release.VERSION_ID.minor:04
    ]

    Which mentions nvidia and pci-10de, suggesting that the discovery was successful as I do not get these entries on my non-GPU nodes.

  2. NVIDIA GPU Feature Discovery (GFD)

    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/gpu-feature-discovery-daemonset.yaml

    After applying the above GFD daemonset and checking the logs

    2023/03/22 09:10:04 Starting OS watcher.
    2023/03/22 09:10:04 Loading configuration.
    2023/03/22 09:10:04 
    Running with config:
    {
    "version": "v1",
    "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
    },
    "resources": {
    "gpus": null
    },
    "sharing": {
    "timeSlicing": {}
    }
    }
    2023/03/22 09:10:04 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
    2023/03/22 09:10:04 Detected non-Tegra platform: /sys/devices/soc0/family file not found
    2023/03/22 09:10:04 WARNING: No valid resources detected; using empty manager.
    2023/03/22 09:10:04 Start running
    2023/03/22 09:10:04 Warning: no labels generated from any source
    2023/03/22 09:10:04 Writing labels to output file
    2023/03/22 09:10:04 Sleeping for 60000000000
    2023/03/22 09:11:04 Warning: no labels generated from any source
    2023/03/22 09:11:04 Writing labels to output file
    2023/03/22 09:11:04 Sleeping for 60000000000

    It says that no labels were generated. Because of the warning WARNING: No valid resources detected; using empty manager.?

As nfd seems to work but gfd does not I exec into the gfd DS to run gpu-feature-discovery from the command line. No luck here to get another output.

Notes I tried this with nvidia-container-toolkit 1.12.1 and 1.13.0-rc.2

nvidia related packages are

+++-==================================-==========================-============-=========================================================
un  libgldispatch0-nvidia              <none>                     <none>       (no description available)
ii  libnvidia-cfg1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                 <none>                     <none>       (no description available)
un  libnvidia-common                   <none>                     <none>       (no description available)
ii  libnvidia-common-530               530.30.02-0ubuntu1         all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                  <none>                     <none>       (no description available)
rc  libnvidia-compute-515-server:amd64 515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA libcompute package
ii  libnvidia-compute-530:amd64        530.30.02-0ubuntu1         amd64        NVIDIA libcompute package
ii  libnvidia-container-tools          1.13.0~rc.2-1              amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64         1.13.0~rc.2-1              amd64        NVIDIA container runtime library
un  libnvidia-decode                   <none>                     <none>       (no description available)
ii  libnvidia-decode-530:amd64         530.30.02-0ubuntu1         amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                   <none>                     <none>       (no description available)
ii  libnvidia-encode-530:amd64         530.30.02-0ubuntu1         amd64        NVENC Video Encoding runtime library
un  libnvidia-extra                    <none>                     <none>       (no description available)
ii  libnvidia-extra-530:amd64          530.30.02-0ubuntu1         amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                     <none>                     <none>       (no description available)
ii  libnvidia-fbc1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                       <none>                     <none>       (no description available)
ii  libnvidia-gl-530:amd64             530.30.02-0ubuntu1         amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml1                      <none>                     <none>       (no description available)
un  nvidia-384                         <none>                     <none>       (no description available)
un  nvidia-390                         <none>                     <none>       (no description available)
un  nvidia-common                      <none>                     <none>       (no description available)
un  nvidia-compute-utils               <none>                     <none>       (no description available)
rc  nvidia-compute-utils-515-server    515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-530           530.30.02-0ubuntu1         amd64        NVIDIA compute utilities
un  nvidia-container-runtime           <none>                     <none>       (no description available)
un  nvidia-container-runtime-hook      <none>                     <none>       (no description available)
ii  nvidia-container-toolkit           1.13.0~rc.2-1              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base      1.13.0~rc.2-1              amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-515-server             515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA DKMS package
ii  nvidia-dkms-530                    530.30.02-0ubuntu1         amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                 <none>                     <none>       (no description available)
ii  nvidia-driver-530                  530.30.02-0ubuntu1         amd64        NVIDIA driver metapackage
un  nvidia-driver-binary               <none>                     <none>       (no description available)
un  nvidia-fabricmanager               <none>                     <none>       (no description available)
ii  nvidia-fabricmanager-515           515.86.01-0ubuntu0.22.04.2 amd64        Fabric Manager for NVSwitch based systems.
un  nvidia-kernel-common               <none>                     <none>       (no description available)
rc  nvidia-kernel-common-515-server    515.86.01-0ubuntu0.22.04.2 amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-530           530.30.02-0ubuntu1         amd64        Shared files used with the kernel module
un  nvidia-kernel-open                 <none>                     <none>       (no description available)
un  nvidia-kernel-open-530             <none>                     <none>       (no description available)
un  nvidia-kernel-source               <none>                     <none>       (no description available)
un  nvidia-kernel-source-515-server    <none>                     <none>       (no description available)
ii  nvidia-kernel-source-530           530.30.02-0ubuntu1         amd64        NVIDIA kernel source package
ii  nvidia-modprobe                    530.30.02-0ubuntu1         amd64        Load the NVIDIA kernel driver and create device files
un  nvidia-opencl-icd                  <none>                     <none>       (no description available)
un  nvidia-persistenced                <none>                     <none>       (no description available)
ii  nvidia-prime                       0.8.17.1                   all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                    530.30.02-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary             <none>                     <none>       (no description available)
un  nvidia-smi                         <none>                     <none>       (no description available)
un  nvidia-utils                       <none>                     <none>       (no description available)
ii  nvidia-utils-530                   530.30.02-0ubuntu1         amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-530      530.30.02-0ubuntu1         amd64        NVIDIA binary Xorg driver
klueska commented 1 year ago

On k3s you need to update /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl not /var/lib/rancher/k3s/agent/etc/containerd/config.toml, otherwise your config will get overwritten.

MichaelJendryke commented 1 year ago

Thanks for the answer @klueska

I found /etc/containerd/config.toml

#   Copyright 2018-2022 Docker Inc.

#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at

#       http://www.apache.org/licenses/LICENSE-2.0

#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"

/var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

But as I did not have /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl, described here, I took the template from this blog post.

Restarting containerd overrides /var/lib/rancher/k3s/agent/etc/containerd/config.toml with

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runc.v2' for GPU support
  runtime_type = "io.containerd.runtime.v1.linux"

# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

But unfortunately GFD DS does not start after that

Warning  FailedCreatePodSandBox  14s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown

I guess the tmpl I found is outdated. Could you please point me to the docs to create this file?

klueska commented 1 year ago

https://docs.k3s.io/advanced#configuring-containerd

MichaelJendryke commented 1 year ago

I have tried to follow tutorials that do not set the default runtime to nvidia (e.g. this). Instead I am trying to follow this. I have modified the NFD, GFD and NVIDIA-DEVICE-PLUGIN yaml files with runtimeClassName: nvidia, which results in the following output of kubectl describe node geo-node1 after GFD started.

Name:               geo-node1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/iommu-enabled=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.0-67-generic
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=15
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/memory-numa=true
                    feature.node.kubernetes.io/pci-102b.present=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    has_gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=geo-node1
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=k3s
                    nvidia.com/cuda.driver.major=530
                    nvidia.com/cuda.driver.minor=30
                    nvidia.com/cuda.driver.rev=02
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=1
                    nvidia.com/gfd.timestamp=1679506166
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=2
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=PowerEdge-R7525
                    nvidia.com/gpu.memory=15360
                    nvidia.com/gpu.product=Tesla-T4
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"8a:ec:37:1f:e5:1a"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: XXX.XXX.XXX.XXX
                    k3s.io/hostname: geo-node1
                    k3s.io/internal-ip: XXX.XXX.XXX.XXX
                    k3s.io/node-args: ["agent"]
                    k3s.io/node-config-hash: FZCHZFCL5KBSRTBWGCGIBHGDW6FDW2LIRBXCGNWFODJI3CKLHCOQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1","K3S_TOKEN":"********","K3S_U...
                    management.cattle.io/pod-limits: {}
                    management.cattle.io/pod-requests: {"pods":"4"}
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,cpu-rd...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 13 Mar 2023 08:47:09 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  geo-node1
  AcquireTime:     <unset>
  RenewTime:       Thu, 23 Mar 2023 08:32:08 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  XXX.XXX.XXX.XXX
  Hostname:    geo-node1
Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
System Info:
  Machine ID:                 f7b72f135bcc4a0195cd924d62fd6437
  System UUID:                4c4c4544-0056-5710-8030-c4c04f4a5433
  Boot ID:                    c6549fac-8176-4c3e-95f3-8e369f793af8
  Kernel Version:             5.15.0-67-generic
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.15-k3s1
  Kubelet Version:            v1.25.6+k3s1
  Kube-Proxy Version:         v1.25.6+k3s1
PodCIDR:                      10.42.1.0/24
PodCIDRs:                     10.42.1.0/24
ProviderID:                   k3s://geo-node1
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     gpu-feature-discovery-d5txw             0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 nvidia-device-plugin-daemonset-gwwtd    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 svclb-traefik-a0d27a00-wjvwp            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d17h
  node-feature-discovery      nfd-spvb2                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:              <none>

The labels are set, but Capacity and Allocatable do not mention GPUs, I assume there should be additional entries(?).

I found these issues helpful:

MichaelJendryke commented 1 year ago

After some tinkering I can report, that I got it to work just fine. I forgot the runtimeClassName: nvidia in my Nvidia Device Plugin, after that everything went well.

For reference and in order:

  1. Get NFD to work with

    # This template contains an example of running nfd-master and nfd-worker in the
    # same pod.
    #
    apiVersion: v1
    kind: Namespace
    metadata:
    name: node-feature-discovery # NFD namespace
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: nfd-master
    namespace: node-feature-discovery
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: nfd-master
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes
    verbs:
    - get
    - patch
    - update
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: nfd-master
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: nfd-master
    subjects:
    - kind: ServiceAccount
    name: nfd-master
    namespace: node-feature-discovery
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    labels:
    app: nfd
    name: nfd
    namespace: node-feature-discovery
    spec:
    selector:
    matchLabels:
      app: nfd
    template:
    metadata:
      labels:
        app: nfd
    spec:
      serviceAccount: nfd-master
      runtimeClassName: nvidia
      containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          name: nfd-master
          command:
            - "nfd-master"
          args:
            - "--extra-label-ns=nvidia.com"
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          name: nfd-worker
          command:
            - "nfd-worker"
          args:
            - "--sleep-interval=60s"
            - "--options={\"sources\": {\"pci\": { \"deviceLabelFields\": [\"vendor\"] }}}"
          volumeMounts:
            - name: host-boot
              mountPath: "/host-boot"
              readOnly: true
            - name: host-os-release
              mountPath: "/host-etc/os-release"
              readOnly: true
            - name: host-sys
              mountPath: "/host-sys"
            - name: source-d
              mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
            - name: features-d
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
  2. Get GFD running with

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    name: gpu-feature-discovery
    labels:
    app.kubernetes.io/name: gpu-feature-discovery
    app.kubernetes.io/version: 0.7.0
    app.kubernetes.io/part-of: nvidia-gpu
    spec:
    selector:
    matchLabels:
      app.kubernetes.io/name: gpu-feature-discovery
      app.kubernetes.io/part-of: nvidia-gpu
    template:
    metadata:
      labels:
        app.kubernetes.io/name: gpu-feature-discovery
        app.kubernetes.io/version: 0.7.0
        app.kubernetes.io/part-of: nvidia-gpu
    spec:
      runtimeClassName: nvidia
      containers:
        - image: nvcr.io/nvidia/gpu-feature-discovery:v0.7.0
          name: gpu-feature-discovery
          volumeMounts:
            - name: output-dir
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
            - name: host-sys
              mountPath: "/sys"
          securityContext:
            privileged: true
          env:
            - name: MIG_STRATEGY
              value: none
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # On discrete-GPU based systems NFD adds the following lable where 10de is te NVIDIA PCI vendor ID
              - key: feature.node.kubernetes.io/pci-10de.present
                operator: In
                values:
                - "true"
            - matchExpressions:
              # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
              - key: feature.node.kubernetes.io/cpu-model.vendor_id
                operator: In
                values:
                - "NVIDIA"
            - matchExpressions:
              # We allow a GFD deployment to be forced by setting the following label to "true"
              - key: "nvidia.com/gpu.present"
                operator: In
                values:
                - "true"
      volumes:
        - name: output-dir
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d"
        - name: host-sys
          hostPath:
            path: "/sys"

    If this is running you should see labels being applied to your node.

  3. Get the NVIDIA device plugin to work with

    
    # Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: runtimeClassName: nvidia tolerations:

Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110

After that you can run a GPU pod, such as documented in the k3s guide.

elezar commented 7 months ago

I'm closing this issue. The use of a runtime class allowed the labels to be generated.