NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.87k stars 635 forks source link

Pods are not scheduled in all GPUs of a physical server. #328

Closed shan100github closed 2 years ago

shan100github commented 2 years ago

Description In the below hardware configuration, while trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and the 4th pod was not spinning up and the following error was displayed.

FailedScheduling    pod/model-2-79d7d6786c-bprm8    0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 1 node(s) didn't match Pod's node affinity/selector.

Information about the environment

Deployment file used:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: model-infr
spec:
  replicas: 4
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: model-infr
    spec:
      containers:
      - args:
        - pip3 install opencv-python-headless && tritonserver --model-store=s3://model-infr/
        command:
        - /bin/sh
        - -c
        image: nvcr.io/nvidia/tritonserver:22.06-py3
        imagePullPolicy: IfNotPresent
        name: tritonserver
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 8001
          name: grpc
          protocol: TCP
        - containerPort: 8002
          name: metrics
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-RTX-A5000
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm

While checking nvidia-smi in the actual system was able to get below output and clearly denotes GPU0 was free to schedule.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   46C    P8    22W / 230W |     10MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 48%   76C    P2   149W / 230W |  13222MiB / 24564MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 52%   81C    P2   166W / 230W |  13222MiB / 24564MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 52%   80C    P2   176W / 230W |  13222MiB / 24564MiB |     66%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    911402      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    911401      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    911223      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+

Expected behavior

Expecting Triton to schedule pods in all 4 GPUs.

Common error checking:

 - [x] The k8s-device-plugin container logs

2022/08/11 06:57:10 Starting Plugins. 2022/08/11 06:57:10 Loading configuration. 2022/08/11 06:57:10 Initializing NVML. 2022/08/11 06:57:10 Updating config with default resource matching patterns. 2022/08/11 06:57:10 Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": true, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/08/11 06:57:10 Retreiving plugins. 2022/08/11 06:57:10 No MIG devices found. Falling back to mig.strategy=none 2022/08/11 06:57:10 Starting GRPC server for 'nvidia.com/gpu' 2022/08/11 06:57:10 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/08/11 06:57:10 Registered device plugin for 'nvidia.com/gpu' with Kubelet



 - [x] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
No specific errors are logged with respect to this pod schedule or NVIDIA.

Additional information that might help better understand your environment and reproduce the bug:
 - [x] Kubelet version from `kubelet version`: Kubernetes v1.21.14+rke2r1
 - [x] Containerd version: containerd github.com/k3s-io/containerd v1.4.13-k3s1 04203d2174f8b8d05bcec98000fba67c0aa69223
klueska commented 2 years ago

From the error message, it appears that you have a taint set up on one of your nodes that you pod isn’t tolerating so the Kubernetes scheduler is blocking it from being scheduled there.

You should check what taints are applied on the node where the pod is not starting and make sure you either (1) add a toleration for that taint in your pod spec, or (2) remove the taint from the node.

shan100github commented 2 years ago

Wonder already this server got 1 GPU that is not used and why it's not being scheduled in this server itself?

klueska commented 2 years ago

Because of the taint, as I mentioned before. Apparently your other nodes don’t have this taint set, but the one where the GPU is not being scheduled does.

shan100github commented 2 years ago

The tainted node is the RTX4000 GPU node, whereas in this deployment I have mentioned the following node selector & from nvidia-smi output, it's evident we have 1 RTX-A5000 GPU which is not yet scheduled with any workload.

      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-RTX-A5000

I have posted the output of nvidia-smi -a for GPU 0. Is there anything suspect with this GPU or GPU settings? Also, I have noticed that even if 1 pod is scheduled in this node it's getting scheduled on GPU 1, though GPU 0 is free.

klueska commented 2 years ago

As far as I can tell from your pod spec you don’t have a toleration set though. Pods will only land on nodes with taints set if they have a toleration for that taint (independent of their node selector).

shan100github commented 2 years ago

In this node with RTX A5000 x 4 GPUs, there is no taint set. And from the same deployment with 4 replicas, 3 pods are scheduled in this node successfully through nodeSelector. Wonder why the remaining 1 pod is not scheduled on GPU 0.

That's the reason I was looking for information on anything suspect on the GPU settings of GPU 0.

klueska commented 2 years ago

Sorry I misunderstood. I thought you had 4 machines each with 1 GPU and one of them wasn’t getting the pod scheduled on it.

So backing up…

What does the output of ‚kubectl describe node‘ show for this node in terms of how many GPUs it thinks it has (both in Capacity and Allocatable).

shan100github commented 2 years ago

On the actual system while I execute nvidia-smi it shows 4 GPUs and on kubectl describe nodes node-agent-4 label shows nvidia.com/gpu.count=4

Name:               node-agent-4
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                   beta.kubernetes.io/instance-type=rke2
                   beta.kubernetes.io/os=linux
                   feature.node.kubernetes.io/cpu-cpuid.ADX=true
                   feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                   feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
                   feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                   feature.node.kubernetes.io/cpu-cpuid.GFNI=true
                   feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                   feature.node.kubernetes.io/cpu-cpuid.SHA=true
                   feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                   feature.node.kubernetes.io/cpu-cpuid.VAES=true
                   feature.node.kubernetes.io/cpu-cpuid.VMX=true
                   feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                   feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                   feature.node.kubernetes.io/cpu-hardware_multithreading=true
                   feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                   feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                   feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                   feature.node.kubernetes.io/custom-rdma.available=true
                   feature.node.kubernetes.io/kernel-config.NO_HZ=true
                   feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                   feature.node.kubernetes.io/kernel-version.full=5.4.0-124-generic
                   feature.node.kubernetes.io/kernel-version.major=5
                   feature.node.kubernetes.io/kernel-version.minor=4
                   feature.node.kubernetes.io/kernel-version.revision=0
                   feature.node.kubernetes.io/memory-numa=true
                   feature.node.kubernetes.io/network-sriov.capable=true
                   feature.node.kubernetes.io/pci-10de.present=true
                   feature.node.kubernetes.io/pci-1a03.present=true
                   feature.node.kubernetes.io/pci-8086.present=true
                   feature.node.kubernetes.io/pci-8086.sriov.capable=true
                   feature.node.kubernetes.io/storage-nonrotationaldisk=true
                   feature.node.kubernetes.io/system-os_release.ID=ubuntu
                   feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                   feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                   feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                   feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
                   kubernetes.io/arch=amd64
                   kubernetes.io/hostname=sbyo-cube-pro-4u-1
                   kubernetes.io/os=linux
                   node.kubernetes.io/instance-type=rke2
                   nvidia.com/cuda.driver.major=510
                   nvidia.com/cuda.driver.minor=54
                   nvidia.com/cuda.driver.rev=
                   nvidia.com/cuda.runtime.major=11
                   nvidia.com/cuda.runtime.minor=7
                   nvidia.com/gfd.timestamp=1660123937
                   nvidia.com/gpu.compute.major=8
                   nvidia.com/gpu.compute.minor=6
                   nvidia.com/gpu.count=4
                   nvidia.com/gpu.deploy.container-toolkit=true
                   nvidia.com/gpu.deploy.dcgm=true
                   nvidia.com/gpu.deploy.dcgm-exporter=true
                   nvidia.com/gpu.deploy.device-plugin=true
                   nvidia.com/gpu.deploy.driver=true
                   nvidia.com/gpu.deploy.gpu-feature-discovery=true
                   nvidia.com/gpu.deploy.node-status-exporter=true
                   nvidia.com/gpu.deploy.operator-validator=true
                   nvidia.com/gpu.family=ampere
                   nvidia.com/gpu.machine=SYS-740GP-TNRT
                   nvidia.com/gpu.memory=25757220864
                   nvidia.com/gpu.present=true
                   nvidia.com/gpu.product=NVIDIA-RTX-A5000
                   nvidia.com/gpu.replicas=1
                   nvidia.com/mig.strategy=single
klueska commented 2 years ago

Those are just the labels applied by GFD. I want to know what the plugin has advertised to the kubelet and what the kubelet currently sees as the Capacity and Allocatable of the „nvidia.com/gpu‘ resource type. Also the currently allocated GPUs, which should be available of you run ‚kubectl describe node‘ on the node.

shan100github commented 2 years ago

Following is the information about Allocated resources.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                16910m (49%)   8400m (24%)
  memory             18910Mi (14%)  18810Mi (14%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     4              4
klueska commented 2 years ago

This still isn’t showing me „Capacity“ and „Allocatable“ of the resource type.

shan100github commented 2 years ago
Capacity:
  cpu:                34
  ephemeral-storage:  1921208612Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131619000Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                34
  ephemeral-storage:  1868951736288
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131619000Ki
  nvidia.com/gpu:     4
  pods:               110
klueska commented 2 years ago

What it is showing me though is that all 4 GPUs are currently assigned to pods. Can you show me the set of pods you have running? Is there a rougue one consuming a GPU somewhere that isn’t part of your deployment.

shan100github commented 2 years ago

nvidia-smi output.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   44C    P8    20W / 230W |     10MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   42C    P8    18W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 30%   44C    P8    16W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 30%   44C    P8    18W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    923258      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    923768      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    926416      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+

I don't see the process tritonserver on GPU-0

shan100github commented 2 years ago

Following pods are running in this sever.

  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  cattle-fleet-system         gitjob-cc9948fd7-qlrbc                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    loki-0                                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
  cattle-monitoring-system    loki-promtail-7mdgj                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         28h
  cattle-monitoring-system    pushprox-kube-controller-manager-proxy-58f5d844c6-x29m6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    pushprox-kube-etcd-proxy-57df468748-zrmbx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    pushprox-kube-proxy-client-sdkv2                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  cattle-monitoring-system    pushprox-kube-proxy-proxy-78b4b985d4-b8d9g                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  cattle-monitoring-system    rancher-monitoring-kube-state-metrics-5bc8bb48bd-w22xl     100m (0%)     100m (0%)   130Mi (0%)       200Mi (0%)     2d8h
  cattle-monitoring-system    rancher-monitoring-prometheus-node-exporter-mbqrx          100m (0%)     200m (0%)   30Mi (0%)        50Mi (0%)      3d5h
  cattle-system               rancher-644bc45f4c-6tsv2                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  default                     model-0-5fb7c59b5c-b779l                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         33h
  default                     model-0-686f46547c-nthpd                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         36h
  gpu-operator                gpu-feature-discovery-zfq4r                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                gpu-operator-node-feature-discovery-worker-zvhvn           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-container-toolkit-daemonset-krrgh                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-dcgm-exporter-9jk7t                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-device-plugin-daemonset-k9cvz                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  gpu-operator                nvidia-operator-validator-sl5jz                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  kube-system                 cilium-dtkhx                                               100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         3d5h
  kube-system                 cilium-node-init-ks925                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  kube-system                 kube-proxy-sbyo-cube-pro-4u-1                              250m (0%)     0 (0%)      0 (0%)           0 (0%)         2d7h
  kube-system                 kube-vip-ds-c7wbk                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d12h
  kube-system                 rke2-coredns-rke2-coredns-6775f768c8-kwzf8                 100m (0%)     100m (0%)   128Mi (0%)       128Mi (0%)     2d8h
  kube-system                 rke2-ingress-nginx-controller-fkqzh                        100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         3d5h
  kube-system                 rke2-metrics-server-8574659c85-wmtxh                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  locust                      locust-master-67568cdf46-59xw7                             1 (2%)        1 (2%)      4Gi (3%)         4Gi (3%)       36h
  locust                      locust-worker-f9b59d8fb-4lkg7                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-6vmrr                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-bznjq                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-d7qpx                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       35h
  locust                      locust-worker-f9b59d8fb-fhkkv                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-lsn9k                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      locust-worker-f9b59d8fb-qdqqh                              1 (2%)        1 (2%)      2Gi (1%)         2Gi (1%)       30h
  locust                      model-0-54f8d6c9bd-466d9                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  locust                      model-0-54f8d6c9bd-m4g2v                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  locust                      model-0-54f8d6c9bd-scpvb                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  longhorn-system             csi-attacher-8b4cc9cf6-6xx8j                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-provisioner-59b7b8b7b8-dmrln                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-resizer-68ccff94-5m5jk                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             csi-snapshotter-6d7d679c98-np7vk                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
  longhorn-system             engine-image-ei-d474e07c-vv5rr                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             instance-manager-e-8f9d237c                                4080m (12%)   0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             instance-manager-r-280a2608                                4080m (12%)   0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-csi-plugin-rmjsr                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-manager-5cqr8                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  longhorn-system             longhorn-ui-556866b6bb-6jrl4                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d8h
klueska commented 2 years ago

Regardless of whether the triton server is running on the GPU or not, some pod must have requested / been given access to all 4 GPUs, otherwise we wouldn't see all 4 of them as Allocated in the output of describe node.

What does this show for that node:

kubectl describe pod -A | grep "nvidia.com/gpu"
shan100github commented 2 years ago

Following is the output

kubectl describe pod -A | grep 5000 
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
shan100github commented 2 years ago

Also for 1 pod from the above pod describe has the following event logged, whereas the rest of the 3 pods are scheduled in the expected server through the nodeSelector.

Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m41s (x4 over 8m2s)  default-scheduler  0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
klueska commented 2 years ago

I'm not worried about the node selector, I want to see which pods have nvidia.com/gpu resources attached to them.

From all of the evidence I see so far, nothing is operating incorrectly. You just seem to have all 1 GPU already allocated to some other pod on that node, so only 3 of them get assigned to your triton-server deployment.

shan100github commented 2 years ago

I think following output would give better info

kubectl describe pod -n locust | egrep "nvidia.com|Node:"
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:           <none>
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
  Warning  FailedScheduling  23m                default-scheduler  0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1
Node-Selectors:              nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node:         agent-node-4/192.142.122.4
      nvidia.com/gpu:  1
      nvidia.com/gpu:  1

Also in https://github.com/NVIDIA/k8s-device-plugin/issues/328#issuecomment-1214162181 all pods scheduled in the nodes are available. Wonder as per your assumption if 1 GPU is allocated to some other process why it's not displayed in nvidia-smi output?

shan100github commented 2 years ago

Also in nvidia-smi -a output for GPU 0 displays only the following process.

    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1580
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2231
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB

whereas for GPU 1 following

    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1580
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2231
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 923258
            Type                          : C
            Name                          : tritonserver
            Used GPU Memory               : 13209 MiB
klueska commented 2 years ago

Just because a GPU has been allocated to a container doesn't mean it is running anything on it, in which case nvidia-smi won't help.

In your query above you limited the output to the locust namespace, but what are these pods:

  default                     model-0-5fb7c59b5c-b779l                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         33h
  default                     model-0-686f46547c-nthpd                                   0 (0%)        0 (0%) 

is it possible that one of them has grabbed hold of a GPU on this node.

shan100github commented 2 years ago

pods in default namespace was crashlooping since because s3 creds are not passed. Even if it's crashlooping is it possible to grab GPU?

~Let me check the hardware by next week.~ Thanks for commenting @klueska

klueska commented 2 years ago

Yes. Allocation of the GPU happens at scheduling time. So if its crash looping then it’s already been scheduled. And if it asked for a GPU then it is reserved for that pod and not available for anyone else (even if it’s crashing).

shan100github commented 2 years ago

thank you for your comments & sharing @klueska. on the deletion of those pods in crashloop got the GPU 0 pod allocation.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:31:00.0 Off |                  Off |
| 30%   45C    P8    25W / 230W |  12626MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    Off  | 00000000:4B:00.0 Off |                  Off |
| 30%   42C    P8    17W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    Off  | 00000000:B1:00.0 Off |                  Off |
| 30%   44C    P8    16W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    Off  | 00000000:CA:00.0 Off |                  Off |
| 30%   44C    P8    23W / 230W |  13222MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   2463377      C   tritonserver                    12613MiB |
|    1   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    923258      C   tritonserver                    13209MiB |
|    2   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    923768      C   tritonserver                    13209MiB |
|    3   N/A  N/A      1580      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2231      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    926416      C   tritonserver                    13209MiB |
+-----------------------------------------------------------------------------+