GFD returns 'no labels generated from any source'

MichaelJendryke commented 1 year ago

Dear all,

I have a setup of k3s and rancher on three nodes. One node has two Tesla T4 GPUs.

Running nvidia-smi on the node directly returns

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:41:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:A1:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Which tells me that the diver is installed correctly and I can proceed in the k3s guide.

The content of /var/lib/rancher/k3s/agent/etc/containerd/config.toml is,

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

, where I added the line default_runtime_name = "nvidia"

I continue with

Node Feature Discovery (NFD)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/nfd.yaml

and also

kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.12.1

which shows the following result in the logs of nfd

REQUEST Node: geo-node1
NFD-version: v0.6.0
Labels: map[
cpu-cpuid.ADX:true
cpu-cpuid.AESNI:true
cpu-cpuid.AVX:true
cpu-cpuid.AVX2:true
cpu-cpuid.FMA3:true
cpu-cpuid.SHA:true
cpu-cpuid.SSE4A:true
cpu-hardware_multithreading:true
cpu-rdt.RDTCMT:true
cpu-rdt.RDTL3CA:true
cpu-rdt.RDTMBM:true
cpu-rdt.RDTMON:true
iommu-enabled:true
kernel-config.NO_HZ:true
kernel-config.NO_HZ_IDLE:true
kernel-version.full:5.15.0-67-generic
kernel-version.major:5
kernel-version.minor:15
kernel-version.revision:0
memory-numa:true
nvidia.com/gfd.timestamp:1679476204
pci-102b.present:true
pci-10de.present:true
pci-10de.sriov.capable:true
storage-nonrotationaldisk:true
system-os_release.ID:ubuntu
system-os_release.VERSION_ID:22.04
system-os_release.VERSION_ID.major:22
system-os_release.VERSION_ID.minor:04
]

Which mentions nvidia and pci-10de, suggesting that the discovery was successful as I do not get these entries on my non-GPU nodes.

NVIDIA GPU Feature Discovery (GFD)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/gpu-feature-discovery-daemonset.yaml

After applying the above GFD daemonset and checking the logs

2023/03/22 09:10:04 Starting OS watcher.
2023/03/22 09:10:04 Loading configuration.
2023/03/22 09:10:04 
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"gfd": {
  "oneshot": false,
  "noTimestamp": false,
  "sleepInterval": "1m0s",
  "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
  "machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
}
}
2023/03/22 09:10:04 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/03/22 09:10:04 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/03/22 09:10:04 WARNING: No valid resources detected; using empty manager.
2023/03/22 09:10:04 Start running
2023/03/22 09:10:04 Warning: no labels generated from any source
2023/03/22 09:10:04 Writing labels to output file
2023/03/22 09:10:04 Sleeping for 60000000000
2023/03/22 09:11:04 Warning: no labels generated from any source
2023/03/22 09:11:04 Writing labels to output file
2023/03/22 09:11:04 Sleeping for 60000000000

It says that no labels were generated. Because of the warning WARNING: No valid resources detected; using empty manager.?

As nfd seems to work but gfd does not I exec into the gfd DS to run gpu-feature-discovery from the command line. No luck here to get another output.

Notes I tried this with nvidia-container-toolkit 1.12.1 and 1.13.0-rc.2

nvidia related packages are

+++-==================================-==========================-============-=========================================================
un  libgldispatch0-nvidia              <none>                     <none>       (no description available)
ii  libnvidia-cfg1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                 <none>                     <none>       (no description available)
un  libnvidia-common                   <none>                     <none>       (no description available)
ii  libnvidia-common-530               530.30.02-0ubuntu1         all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                  <none>                     <none>       (no description available)
rc  libnvidia-compute-515-server:amd64 515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA libcompute package
ii  libnvidia-compute-530:amd64        530.30.02-0ubuntu1         amd64        NVIDIA libcompute package
ii  libnvidia-container-tools          1.13.0~rc.2-1              amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64         1.13.0~rc.2-1              amd64        NVIDIA container runtime library
un  libnvidia-decode                   <none>                     <none>       (no description available)
ii  libnvidia-decode-530:amd64         530.30.02-0ubuntu1         amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                   <none>                     <none>       (no description available)
ii  libnvidia-encode-530:amd64         530.30.02-0ubuntu1         amd64        NVENC Video Encoding runtime library
un  libnvidia-extra                    <none>                     <none>       (no description available)
ii  libnvidia-extra-530:amd64          530.30.02-0ubuntu1         amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                     <none>                     <none>       (no description available)
ii  libnvidia-fbc1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                       <none>                     <none>       (no description available)
ii  libnvidia-gl-530:amd64             530.30.02-0ubuntu1         amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml1                      <none>                     <none>       (no description available)
un  nvidia-384                         <none>                     <none>       (no description available)
un  nvidia-390                         <none>                     <none>       (no description available)
un  nvidia-common                      <none>                     <none>       (no description available)
un  nvidia-compute-utils               <none>                     <none>       (no description available)
rc  nvidia-compute-utils-515-server    515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-530           530.30.02-0ubuntu1         amd64        NVIDIA compute utilities
un  nvidia-container-runtime           <none>                     <none>       (no description available)
un  nvidia-container-runtime-hook      <none>                     <none>       (no description available)
ii  nvidia-container-toolkit           1.13.0~rc.2-1              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base      1.13.0~rc.2-1              amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-515-server             515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA DKMS package
ii  nvidia-dkms-530                    530.30.02-0ubuntu1         amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                 <none>                     <none>       (no description available)
ii  nvidia-driver-530                  530.30.02-0ubuntu1         amd64        NVIDIA driver metapackage
un  nvidia-driver-binary               <none>                     <none>       (no description available)
un  nvidia-fabricmanager               <none>                     <none>       (no description available)
ii  nvidia-fabricmanager-515           515.86.01-0ubuntu0.22.04.2 amd64        Fabric Manager for NVSwitch based systems.
un  nvidia-kernel-common               <none>                     <none>       (no description available)
rc  nvidia-kernel-common-515-server    515.86.01-0ubuntu0.22.04.2 amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-530           530.30.02-0ubuntu1         amd64        Shared files used with the kernel module
un  nvidia-kernel-open                 <none>                     <none>       (no description available)
un  nvidia-kernel-open-530             <none>                     <none>       (no description available)
un  nvidia-kernel-source               <none>                     <none>       (no description available)
un  nvidia-kernel-source-515-server    <none>                     <none>       (no description available)
ii  nvidia-kernel-source-530           530.30.02-0ubuntu1         amd64        NVIDIA kernel source package
ii  nvidia-modprobe                    530.30.02-0ubuntu1         amd64        Load the NVIDIA kernel driver and create device files
un  nvidia-opencl-icd                  <none>                     <none>       (no description available)
un  nvidia-persistenced                <none>                     <none>       (no description available)
ii  nvidia-prime                       0.8.17.1                   all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                    530.30.02-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary             <none>                     <none>       (no description available)
un  nvidia-smi                         <none>                     <none>       (no description available)
un  nvidia-utils                       <none>                     <none>       (no description available)
ii  nvidia-utils-530                   530.30.02-0ubuntu1         amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-530      530.30.02-0ubuntu1         amd64        NVIDIA binary Xorg driver

klueska commented 1 year ago

On k3s you need to update /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl not /var/lib/rancher/k3s/agent/etc/containerd/config.toml, otherwise your config will get overwritten.

MichaelJendryke commented 1 year ago

Thanks for the answer @klueska

I found /etc/containerd/config.toml

#   Copyright 2018-2022 Docker Inc.

#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at

#       http://www.apache.org/licenses/LICENSE-2.0

#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"

/var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

But as I did not have /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl, described here, I took the template from this blog post.

Restarting containerd overrides /var/lib/rancher/k3s/agent/etc/containerd/config.toml with

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runc.v2' for GPU support
  runtime_type = "io.containerd.runtime.v1.linux"

# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

But unfortunately GFD DS does not start after that

Warning  FailedCreatePodSandBox  14s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown

I guess the tmpl I found is outdated. Could you please point me to the docs to create this file?

klueska commented 1 year ago

https://docs.k3s.io/advanced#configuring-containerd

MichaelJendryke commented 1 year ago

I have tried to follow tutorials that do not set the default runtime to nvidia (e.g. this). Instead I am trying to follow this. I have modified the NFD, GFD and NVIDIA-DEVICE-PLUGIN yaml files with runtimeClassName: nvidia, which results in the following output of kubectl describe node geo-node1 after GFD started.

Name:               geo-node1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/iommu-enabled=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.0-67-generic
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=15
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/memory-numa=true
                    feature.node.kubernetes.io/pci-102b.present=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    has_gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=geo-node1
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=k3s
                    nvidia.com/cuda.driver.major=530
                    nvidia.com/cuda.driver.minor=30
                    nvidia.com/cuda.driver.rev=02
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=1
                    nvidia.com/gfd.timestamp=1679506166
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=2
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=PowerEdge-R7525
                    nvidia.com/gpu.memory=15360
                    nvidia.com/gpu.product=Tesla-T4
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"8a:ec:37:1f:e5:1a"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: XXX.XXX.XXX.XXX
                    k3s.io/hostname: geo-node1
                    k3s.io/internal-ip: XXX.XXX.XXX.XXX
                    k3s.io/node-args: ["agent"]
                    k3s.io/node-config-hash: FZCHZFCL5KBSRTBWGCGIBHGDW6FDW2LIRBXCGNWFODJI3CKLHCOQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1","K3S_TOKEN":"********","K3S_U...
                    management.cattle.io/pod-limits: {}
                    management.cattle.io/pod-requests: {"pods":"4"}
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,cpu-rd...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 13 Mar 2023 08:47:09 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  geo-node1
  AcquireTime:     <unset>
  RenewTime:       Thu, 23 Mar 2023 08:32:08 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  XXX.XXX.XXX.XXX
  Hostname:    geo-node1
Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
System Info:
  Machine ID:                 f7b72f135bcc4a0195cd924d62fd6437
  System UUID:                4c4c4544-0056-5710-8030-c4c04f4a5433
  Boot ID:                    c6549fac-8176-4c3e-95f3-8e369f793af8
  Kernel Version:             5.15.0-67-generic
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.15-k3s1
  Kubelet Version:            v1.25.6+k3s1
  Kube-Proxy Version:         v1.25.6+k3s1
PodCIDR:                      10.42.1.0/24
PodCIDRs:                     10.42.1.0/24
ProviderID:                   k3s://geo-node1
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     gpu-feature-discovery-d5txw             0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 nvidia-device-plugin-daemonset-gwwtd    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 svclb-traefik-a0d27a00-wjvwp            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d17h
  node-feature-discovery      nfd-spvb2                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:              <none>

The labels are set, but Capacity and Allocatable do not mention GPUs, I assume there should be additional entries(?).

I found these issues helpful:

MichaelJendryke commented 1 year ago

After some tinkering I can report, that I got it to work just fine. I forgot the runtimeClassName: nvidia in my Nvidia Device Plugin, after that everything went well.

For reference and in order:

Get NFD to work with

# This template contains an example of running nfd-master and nfd-worker in the
# same pod.
#
apiVersion: v1
kind: Namespace
metadata:
name: node-feature-discovery # NFD namespace
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfd-master
namespace: node-feature-discovery
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: nfd-master
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- patch
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nfd-master
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: nfd-master
subjects:
- kind: ServiceAccount
name: nfd-master
namespace: node-feature-discovery
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: nfd
name: nfd
namespace: node-feature-discovery
spec:
selector:
matchLabels:
  app: nfd
template:
metadata:
  labels:
    app: nfd
spec:
  serviceAccount: nfd-master
  runtimeClassName: nvidia
  containers:
    - env:
      - name: NODE_NAME
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName
      image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
      name: nfd-master
      command:
        - "nfd-master"
      args:
        - "--extra-label-ns=nvidia.com"
    - env:
      - name: NODE_NAME
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName
      image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
      name: nfd-worker
      command:
        - "nfd-worker"
      args:
        - "--sleep-interval=60s"
        - "--options={\"sources\": {\"pci\": { \"deviceLabelFields\": [\"vendor\"] }}}"
      volumeMounts:
        - name: host-boot
          mountPath: "/host-boot"
          readOnly: true
        - name: host-os-release
          mountPath: "/host-etc/os-release"
          readOnly: true
        - name: host-sys
          mountPath: "/host-sys"
        - name: source-d
          mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
  volumes:
    - name: host-boot
      hostPath:
        path: "/boot"
    - name: host-os-release
      hostPath:
        path: "/etc/os-release"
    - name: host-sys
      hostPath:
        path: "/sys"
    - name: source-d
      hostPath:
        path: "/etc/kubernetes/node-feature-discovery/source.d/"
    - name: features-d
      hostPath:
        path: "/etc/kubernetes/node-feature-discovery/features.d/"

Get GFD running with

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.7.0
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
matchLabels:
  app.kubernetes.io/name: gpu-feature-discovery
  app.kubernetes.io/part-of: nvidia-gpu
template:
metadata:
  labels:
    app.kubernetes.io/name: gpu-feature-discovery
    app.kubernetes.io/version: 0.7.0
    app.kubernetes.io/part-of: nvidia-gpu
spec:
  runtimeClassName: nvidia
  containers:
    - image: nvcr.io/nvidia/gpu-feature-discovery:v0.7.0
      name: gpu-feature-discovery
      volumeMounts:
        - name: output-dir
          mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
        - name: host-sys
          mountPath: "/sys"
      securityContext:
        privileged: true
      env:
        - name: MIG_STRATEGY
          value: none
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          # On discrete-GPU based systems NFD adds the following lable where 10de is te NVIDIA PCI vendor ID
          - key: feature.node.kubernetes.io/pci-10de.present
            operator: In
            values:
            - "true"
        - matchExpressions:
          # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
          - key: feature.node.kubernetes.io/cpu-model.vendor_id
            operator: In
            values:
            - "NVIDIA"
        - matchExpressions:
          # We allow a GFD deployment to be forced by setting the following label to "true"
          - key: "nvidia.com/gpu.present"
            operator: In
            values:
            - "true"
  volumes:
    - name: output-dir
      hostPath:
        path: "/etc/kubernetes/node-feature-discovery/features.d"
    - name: host-sys
      hostPath:
        path: "/sys"

If this is running you should see labels being applied to your node.

Get the NVIDIA device plugin to work with


# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: runtimeClassName: nvidia tolerations:

key: nvidia.com/gpu operator: Exists effect: NoSchedule
Mark this pod as a critical add-on; when enabled, the critical add-on

scheduler reserves resources for critical add-on pods so that they can

be rescheduled after a failure.

See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

priorityClassName: "system-node-critical" containers:
image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0 name: nvidia-device-plugin-ctr env:
- name: FAIL_ON_INIT_ERROR value: "false" securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts:
  - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes:

name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins


Only after the device plugin finished, the `kubectl describe node` command will show

Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110

After that you can run a GPU pod, such as documented in the k3s guide.

elezar commented 7 months ago

I'm closing this issue. The use of a runtime class allowed the labels to be generated.

NVIDIA / gpu-feature-discovery

GFD returns 'no labels generated from any source' #36

Mark this pod as a critical add-on; when enabled, the critical add-on

scheduler reserves resources for critical add-on pods so that they can

be rescheduled after a failure.

See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/