Pods are not scheduled in all GPUs of a physical server.

Description In the below hardware configuration, while trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and the 4th pod was not spinning up and the following error was displayed.

FailedScheduling    pod/model-2-79d7d6786c-bprm8    0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 1 node(s) didn't match Pod's node affinity/selector.

Information about the environment

[x] Hardware - RTX A5000 x 4 GPUs
[x] Hardware - Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
[x] Operating System - Linux agentnode 5.4.0-124-generic 140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[x] GPU-Operator helm version: gpu-operator-v1.11.1

Deployment file used:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: model-infr
spec:
  replicas: 4
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: model-infr
    spec:
      containers:
      - args:
        - pip3 install opencv-python-headless && tritonserver --model-store=s3://model-infr/
        command:
        - /bin/sh
        - -c
        image: nvcr.io/nvidia/tritonserver:22.06-py3
        imagePullPolicy: IfNotPresent
        name: tritonserver
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 8001
          name: grpc
          protocol: TCP
        - containerPort: 8002
          name: metrics
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-RTX-A5000
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm

While checking nvidia-smi in the actual system was able to get below output and clearly denotes GPU0 was free to schedule.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   46C    P8    22W / 230W |     10MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 48%   76C    P2   149W / 230W |  13222MiB / 24564MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 52%   81C    P2   166W / 230W |  13222MiB / 24564MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 52%   80C    P2   176W / 230W |  13222MiB / 24564MiB |     66%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    911402      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    911401      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    911223      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+

Expected behavior

Expecting Triton to schedule pods in all 4 GPUs.

Common error checking:

[x] The output of nvidia-smi -a on your host


GPU 00000000:31:00.0
Product Name                          : NVIDIA RTX A5000
Product Brand                         : NVIDIA RTX
Product Architecture                  : Ampere
Display Mode                          : Disabled
Display Active                        : Disabled
Persistence Mode                      : Disabled
MIG Mode
    Current                           : N/A
    Pending                           : N/A
Accounting Mode                       : Disabled
Accounting Mode Buffer Size           : 4000
Driver Model
    Current                           : N/A
    Pending                           : N/A
Serial Number                         : 1321721010340
GPU UUID                              : GPU-94fc63c0-00d3-0076-199f-f67cd9e4c4d2
Minor Number                          : 0
VBIOS Version                         : 94.02.6D.00.05
MultiGPU Board                        : No
Board ID                              : 0x3100
GPU Part Number                       : 900-5G132-2200-000
Module ID                             : 0
Inforom Version
    Image Version                     : G132.0500.00.01
    OEM Object                        : 2.0
    ECC Object                        : 6.16
    Power Management Object           : N/A
GPU Operation Mode
    Current                           : N/A
    Pending                           : N/A
GSP Firmware Version                  : N/A
GPU Virtualization Mode
    Virtualization Mode               : None
    Host VGPU Mode                    : N/A
IBMNPU
    Relaxed Ordering Mode             : N/A
PCI
    Bus                               : 0x31
    Device                            : 0x00
    Domain                            : 0x0000
    Device Id                         : 0x223110DE
    Bus Id                            : 00000000:31:00.0
    Sub System Id                     : 0x147E10DE
    GPU Link Info
        PCIe Generation
            Max                       : 4
            Current                   : 1
        Link Width
            Max                       : 16x
            Current                   : 16x
    Bridge Chip
        Type                          : N/A
        Firmware                      : N/A
    Replays Since Reset               : 0
    Replay Number Rollovers           : 0
    Tx Throughput                     : 0 KB/s
    Rx Throughput                     : 0 KB/s
Fan Speed                             : 30 %
Performance State                     : P8
Clocks Throttle Reasons
    Idle                              : Active
    Applications Clocks Setting       : Not Active
    SW Power Cap                      : Not Active
    HW Slowdown                       : Not Active
        HW Thermal Slowdown           : Not Active
        HW Power Brake Slowdown       : Not Active
    Sync Boost                        : Not Active
    SW Thermal Slowdown               : Not Active
    Display Clock Setting             : Not Active
FB Memory Usage
    Total                             : 24564 MiB
    Reserved                          : 307 MiB
    Used                              : 10 MiB
    Free                              : 24245 MiB
BAR1 Memory Usage
    Total                             : 256 MiB
    Used                              : 3 MiB
    Free                              : 253 MiB
Compute Mode                          : Default
Utilization
    Gpu                               : 0 %
    Memory                            : 0 %
    Encoder                           : 0 %
    Decoder                           : 0 %
Encoder Stats
    Active Sessions                   : 0
    Average FPS                       : 0
    Average Latency                   : 0
FBC Stats
    Active Sessions                   : 0
    Average FPS                       : 0
    Average Latency                   : 0
Ecc Mode
    Current                           : Disabled
    Pending                           : Disabled
ECC Errors
    Volatile
        SRAM Correctable              : N/A
        SRAM Uncorrectable            : N/A
        DRAM Correctable              : N/A
        DRAM Uncorrectable            : N/A
    Aggregate
        SRAM Correctable              : N/A
        SRAM Uncorrectable            : N/A
        DRAM Correctable              : N/A
        DRAM Uncorrectable            : N/A
Retired Pages
    Single Bit ECC                    : N/A
    Double Bit ECC                    : N/A
    Pending Page Blacklist            : N/A
Remapped Rows
    Correctable Error                 : 0
    Uncorrectable Error               : 0
    Pending                           : No
    Remapping Failure Occurred        : No
    Bank Remap Availability Histogram
        Max                           : 192 bank(s)
        High                          : 0 bank(s)
        Partial                       : 0 bank(s)
        Low                           : 0 bank(s)
        None                          : 0 bank(s)
Temperature
    GPU Current Temp                  : 40 C
    GPU Shutdown Temp                 : 98 C
    GPU Slowdown Temp                 : 95 C
    GPU Max Operating Temp            : 90 C
    GPU Target Temperature            : 84 C
    Memory Current Temp               : N/A
    Memory Max Operating Temp         : N/A
Power Readings
    Power Management                  : Supported
    Power Draw                        : 20.57 W
    Power Limit                       : 230.00 W
    Default Power Limit               : 230.00 W
    Enforced Power Limit              : 230.00 W
    Min Power Limit                   : 100.00 W
    Max Power Limit                   : 230.00 W
Clocks
    Graphics                          : 0 MHz
    SM                                : 0 MHz
    Memory                            : 405 MHz
    Video                             : 555 MHz
Applications Clocks
    Graphics                          : 1695 MHz
    Memory                            : 8001 MHz
Default Applications Clocks
    Graphics                          : 1695 MHz
    Memory                            : 8001 MHz
Max Clocks
    Graphics                          : 2100 MHz
    SM                                : 2100 MHz
    Memory                            : 8001 MHz
    Video                             : 1950 MHz
Max Customer Boost Clocks
    Graphics                          : N/A
Clock Policy
    Auto Boost                        : N/A
    Auto Boost Default                : N/A
Voltage
    Graphics                          : 0.000 mV
Processes
    GPU instance ID                   : N/A
    Compute instance ID               : N/A
    Process ID                        : 1580
        Type                          : G
        Name                          : /usr/lib/xorg/Xorg
        Used GPU Memory               : 4 MiB
    GPU instance ID                   : N/A
    Compute instance ID               : N/A
    Process ID                        : 2231
        Type                          : G
        Name                          : /usr/lib/xorg/Xorg
        Used GPU Memory               : 4 MiB

 - [x] The k8s-device-plugin container logs

2022/08/11 06:57:10 Starting Plugins. 2022/08/11 06:57:10 Loading configuration. 2022/08/11 06:57:10 Initializing NVML. 2022/08/11 06:57:10 Updating config with default resource matching patterns. 2022/08/11 06:57:10 Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": true, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/08/11 06:57:10 Retreiving plugins. 2022/08/11 06:57:10 No MIG devices found. Falling back to mig.strategy=none 2022/08/11 06:57:10 Starting GRPC server for 'nvidia.com/gpu' 2022/08/11 06:57:10 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/08/11 06:57:10 Registered device plugin for 'nvidia.com/gpu' with Kubelet



 - [x] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
No specific errors are logged with respect to this pod schedule or NVIDIA.

Additional information that might help better understand your environment and reproduce the bug:
 - [x] Kubelet version from `kubelet version`: Kubernetes v1.21.14+rke2r1
 - [x] Containerd version: containerd github.com/k3s-io/containerd v1.4.13-k3s1 04203d2174f8b8d05bcec98000fba67c0aa69223

From the error message, it appears that you have a taint set up on one of your nodes that you pod isn’t tolerating so the Kubernetes scheduler is blocking it from being scheduled there.

You should check what taints are applied on the node where the pod is not starting and make sure you either (1) add a toleration for that taint in your pod spec, or (2) remove the taint from the node.

Wonder already this server got 1 GPU that is not used and why it's not being scheduled in this server itself?

Because of the taint, as I mentioned before. Apparently your other nodes don’t have this taint set, but the one where the GPU is not being scheduled does.

The tainted node is the RTX4000 GPU node, whereas in this deployment I have mentioned the following node selector & from nvidia-smi output, it's evident we have 1 RTX-A5000 GPU which is not yet scheduled with any workload.

      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-RTX-A5000

I have posted the output of nvidia-smi -a for GPU 0. Is there anything suspect with this GPU or GPU settings? Also, I have noticed that even if 1 pod is scheduled in this node it's getting scheduled on GPU 1, though GPU 0 is free.

As far as I can tell from your pod spec you don’t have a toleration set though. Pods will only land on nodes with taints set if they have a toleration for that taint (independent of their node selector).

In this node with RTX A5000 x 4 GPUs, there is no taint set. And from the same deployment with 4 replicas, 3 pods are scheduled in this node successfully through nodeSelector. Wonder why the remaining 1 pod is not scheduled on GPU 0.

That's the reason I was looking for information on anything suspect on the GPU settings of GPU 0.

Sorry I misunderstood. I thought you had 4 machines each with 1 GPU and one of them wasn’t getting the pod scheduled on it.

So backing up…

What does the output of ‚kubectl describe node‘ show for this node in terms of how many GPUs it thinks it has (both in Capacity and Allocatable).

On the actual system while I execute nvidia-smi it shows 4 GPUs and on kubectl describe nodes node-agent-4 label shows nvidia.com/gpu.count=4

Name:               node-agent-4
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                   beta.kubernetes.io/instance-type=rke2
                   beta.kubernetes.io/os=linux
                   feature.node.kubernetes.io/cpu-cpuid.ADX=true
                   feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
                   feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                   feature.node.kubernetes.io/cpu-cpuid.GFNI=true
                   feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                   feature.node.kubernetes.io/cpu-cpuid.SHA=true
                   feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                   feature.node.kubernetes.io/cpu-cpuid.VAES=true
                   feature.node.kubernetes.io/cpu-cpuid.VMX=true
                   feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                   feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                   feature.node.kubernetes.io/cpu-hardware_multithreading=true
                   feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                   feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                   feature.node.kubernetes.io/custom-rdma.available=true
                   feature.node.kubernetes.io/kernel-config.NO_HZ=true
                   feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                   feature.node.kubernetes.io/kernel-version.full=5.4.0-124-generic
                   feature.node.kubernetes.io/kernel-version.major=5
                   feature.node.kubernetes.io/kernel-version.minor=4
                   feature.node.kubernetes.io/kernel-version.revision=0
                   feature.node.kubernetes.io/memory-numa=true
                   feature.node.kubernetes.io/network-sriov.capable=true
                   feature.node.kubernetes.io/pci-10de.present=true
                   feature.node.kubernetes.io/pci-1a03.present=true
                   feature.node.kubernetes.io/pci-8086.present=true
                   feature.node.kubernetes.io/pci-8086.sriov.capable=true
                   feature.node.kubernetes.io/storage-nonrotationaldisk=true
                   feature.node.kubernetes.io/system-os_release.ID=ubuntu
                   feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                   feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                   feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                   feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
                   kubernetes.io/arch=amd64
                   kubernetes.io/hostname=sbyo-cube-pro-4u-1
                   kubernetes.io/os=linux
                   node.kubernetes.io/instance-type=rke2
                   nvidia.com/cuda.driver.major=510
                   nvidia.com/cuda.driver.minor=54
                   nvidia.com/cuda.driver.rev=
                   nvidia.com/cuda.runtime.major=11
                   nvidia.com/cuda.runtime.minor=7
                   nvidia.com/gfd.timestamp=1660123937
                   nvidia.com/gpu.compute.major=8
                   nvidia.com/gpu.compute.minor=6
                   nvidia.com/gpu.count=4
                   nvidia.com/gpu.deploy.container-toolkit=true
                   nvidia.com/gpu.deploy.dcgm=true
                   nvidia.com/gpu.deploy.dcgm-exporter=true
                   nvidia.com/gpu.deploy.device-plugin=true
                   nvidia.com/gpu.deploy.driver=true
                   nvidia.com/gpu.deploy.gpu-feature-discovery=true
                   nvidia.com/gpu.deploy.node-status-exporter=true
                   nvidia.com/gpu.deploy.operator-validator=true
                   nvidia.com/gpu.family=ampere
                   nvidia.com/gpu.machine=SYS-740GP-TNRT
                   nvidia.com/gpu.memory=25757220864
                   nvidia.com/gpu.present=true
                   nvidia.com/gpu.product=NVIDIA-RTX-A5000
                   nvidia.com/gpu.replicas=1
                   nvidia.com/mig.strategy=single

Those are just the labels applied by GFD. I want to know what the plugin has advertised to the kubelet and what the kubelet currently sees as the Capacity and Allocatable of the „nvidia.com/gpu‘ resource type. Also the currently allocated GPUs, which should be available of you run ‚kubectl describe node‘ on the node.

Following is the information about Allocated resources.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                16910m (49%)   8400m (24%)
  memory             18910Mi (14%)  18810Mi (14%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4              4

This still isn’t showing me „Capacity“ and „Allocatable“ of the resource type.

Capacity:
  cpu:                34
  ephemeral-storage:  1921208612Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131619000Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                34
  ephemeral-storage:  1868951736288
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131619000Ki
  nvidia.com/gpu:     4
  pods:               110

What it is showing me though is that all 4 GPUs are currently assigned to pods. Can you show me the set of pods you have running? Is there a rougue one consuming a GPU somewhere that isn’t part of your deployment.

nvidia-smi output.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   44C    P8    20W / 230W |     10MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   42C    P8    18W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 30%   44C    P8    16W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 30%   44C    P8    18W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    923258      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    923768      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    926416      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+

I don't see the process tritonserver on GPU-0

Following pods are running in this sever.

  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  cattle-fleet-system         gitjob-cc9948fd7-qlrbc                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    loki-0                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
  cattle-monitoring-system    loki-promtail-7mdgj                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
  cattle-monitoring-system    pushprox-kube-controller-manager-proxy-58f5d844c6-x29m6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    pushprox-kube-etcd-proxy-57df468748-zrmbx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    pushprox-kube-proxy-client-sdkv2                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  cattle-monitoring-system    pushprox-kube-proxy-proxy-78b4b985d4-b8d9g                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    rancher-monitoring-kube-state-metrics-5bc8bb48bd-w22xl     100m (0%)     100m (0%)   130Mi (0%)       200Mi (0%)     2d8h
  cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-mbqrx          100m (0%)     200m (0%)   30Mi (0%)        50Mi (0%)      3d5h
  cattle-system               rancher-644bc45f4c-6tsv2                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  default                     model-0-5fb7c59b5c-b779l                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         33h
  default                     model-0-686f46547c-nthpd                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         36h
  gpu-operator                gpu-feature-discovery-zfq4r                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                gpu-operator-node-feature-discovery-worker-zvhvn           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-container-toolkit-daemonset-krrgh                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-dcgm-exporter-9jk7t                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-device-plugin-daemonset-k9cvz                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-operator-validator-sl5jz                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  kube-system                 cilium-dtkhx                                               100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         3d5h
  kube-system                 cilium-node-init-ks925                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  kube-system                 kube-proxy-sbyo-cube-pro-4u-1                              250m (0%)     0 (0%)      0 (0%)           0 (0%)         2d7h
  kube-system                 kube-vip-ds-c7wbk                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d12h
  kube-system                 rke2-coredns-rke2-coredns-6775f768c8-kwzf8                 100m (0%)     100m (0%)   128Mi (0%)       128Mi (0%)     2d8h
  kube-system                 rke2-ingress-nginx-controller-fkqzh                        100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         3d5h
  kube-system                 rke2-metrics-server-8574659c85-wmtxh                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  locust                      locust-master-67568cdf46-59xw7                             1 (2%)        1 (2%)      4Gi (3%)         4Gi (3%)       36h
  locust                      locust-worker-f9b59d8fb-4lkg7                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-6vmrr                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-bznjq                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-d7qpx                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       35h
  locust                      locust-worker-f9b59d8fb-fhkkv                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-lsn9k                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-qdqqh                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      model-0-54f8d6c9bd-466d9                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  locust                      model-0-54f8d6c9bd-m4g2v                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  locust                      model-0-54f8d6c9bd-scpvb                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  longhorn-system             csi-attacher-8b4cc9cf6-6xx8j                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-provisioner-59b7b8b7b8-dmrln                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-resizer-68ccff94-5m5jk                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-snapshotter-6d7d679c98-np7vk                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             engine-image-ei-d474e07c-vv5rr                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             instance-manager-e-8f9d237c                                4080m (12%)   0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             instance-manager-r-280a2608                                4080m (12%)   0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-csi-plugin-rmjsr                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-manager-5cqr8                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-ui-556866b6bb-6jrl4                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h

Regardless of whether the triton server is running on the GPU or not, some pod must have requested / been given access to all 4 GPUs, otherwise we wouldn't see all 4 of them as Allocated in the output of describe node.

What does this show for that node:

kubectl describe pod -A | grep "nvidia.com/gpu"

Following is the output

kubectl describe pod -A | grep 5000 
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000

Also for 1 pod from the above pod describe has the following event logged, whereas the rest of the 3 pods are scheduled in the expected server through the nodeSelector.

Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m41s (x4 over 8m2s)  default-scheduler  0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.

I'm not worried about the node selector, I want to see which pods have nvidia.com/gpu resources attached to them.

From all of the evidence I see so far, nothing is operating incorrectly. You just seem to have all 1 GPU already allocated to some other pod on that node, so only 3 of them get assigned to your triton-server deployment.

I think following output would give better info

kubectl describe pod -n locust | egrep "nvidia.com|Node:"
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:           <none>
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
  Warning  FailedScheduling  23m                default-scheduler  0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1

Also in https://github.com/NVIDIA/k8s-device-plugin/issues/328#issuecomment-1214162181 all pods scheduled in the nodes are available. Wonder as per your assumption if 1 GPU is allocated to some other process why it's not displayed in nvidia-smi output?

Also in nvidia-smi -a output for GPU 0 displays only the following process.

    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1580
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2231
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB

whereas for GPU 1 following

    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1580
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2231
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 923258
            Type                          : C
            Name                          : tritonserver
            Used GPU Memory               : 13209 MiB

Just because a GPU has been allocated to a container doesn't mean it is running anything on it, in which case nvidia-smi won't help.

In your query above you limited the output to the locust namespace, but what are these pods:

  default                     model-0-5fb7c59b5c-b779l                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         33h
  default                     model-0-686f46547c-nthpd                                   0 (0%)        0 (0%)

is it possible that one of them has grabbed hold of a GPU on this node.

pods in default namespace was crashlooping since because s3 creds are not passed. Even if it's crashlooping is it possible to grab GPU?

~Let me check the hardware by next week.~ Thanks for commenting @klueska

Yes. Allocation of the GPU happens at scheduling time. So if its crash looping then it’s already been scheduled. And if it asked for a GPU then it is reserved for that pod and not available for anyone else (even if it’s crashing).

thank you for your comments & sharing @klueska. on the deletion of those pods in crashloop got the GPU 0 pod allocation.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   45C    P8    25W / 230W |  12626MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   42C    P8    17W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 30%   44C    P8    16W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 30%   44C    P8    23W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   2463377      C   tritonserver                    12613MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    923258      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    923768      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    926416      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+

NVIDIA / k8s-device-plugin

Pods are not scheduled in all GPUs of a physical server. #328

Expected behavior

Common error checking: