Closed shan100github closed 2 years ago
From the error message, it appears that you have a taint set up on one of your nodes that you pod isn’t tolerating so the Kubernetes scheduler is blocking it from being scheduled there.
You should check what taints are applied on the node where the pod is not starting and make sure you either (1) add a toleration for that taint in your pod spec, or (2) remove the taint from the node.
Wonder already this server got 1 GPU that is not used and why it's not being scheduled in this server itself?
Because of the taint, as I mentioned before. Apparently your other nodes don’t have this taint set, but the one where the GPU is not being scheduled does.
The tainted node is the RTX4000 GPU node, whereas in this deployment I have mentioned the following node selector & from nvidia-smi output, it's evident we have 1 RTX-A5000 GPU which is not yet scheduled with any workload.
nodeSelector:
nvidia.com/gpu.product: NVIDIA-RTX-A5000
I have posted the output of nvidia-smi -a
for GPU 0
. Is there anything suspect with this GPU or GPU settings?
Also, I have noticed that even if 1 pod is scheduled in this node it's getting scheduled on GPU 1
, though GPU 0
is free.
As far as I can tell from your pod spec you don’t have a toleration set though. Pods will only land on nodes with taints set if they have a toleration for that taint (independent of their node selector).
In this node with RTX A5000 x 4 GPUs, there is no taint set.
And from the same deployment with 4 replicas, 3 pods are scheduled in this node successfully through nodeSelector
.
Wonder why the remaining 1 pod is not scheduled on GPU 0
.
That's the reason I was looking for information on anything suspect on the GPU settings of GPU 0
.
Sorry I misunderstood. I thought you had 4 machines each with 1 GPU and one of them wasn’t getting the pod scheduled on it.
So backing up…
What does the output of ‚kubectl describe node‘ show for this node in terms of how many GPUs it thinks it has (both in Capacity and Allocatable).
On the actual system while I execute nvidia-smi
it shows 4 GPUs and on kubectl describe nodes node-agent-4
label shows nvidia.com/gpu.count=4
Name: node-agent-4
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=rke2
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
feature.node.kubernetes.io/cpu-rdt.RDTMON=true
feature.node.kubernetes.io/custom-rdma.available=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=5.4.0-124-generic
feature.node.kubernetes.io/kernel-version.major=5
feature.node.kubernetes.io/kernel-version.minor=4
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/memory-numa=true
feature.node.kubernetes.io/network-sriov.capable=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-1a03.present=true
feature.node.kubernetes.io/pci-8086.present=true
feature.node.kubernetes.io/pci-8086.sriov.capable=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=sbyo-cube-pro-4u-1
kubernetes.io/os=linux
node.kubernetes.io/instance-type=rke2
nvidia.com/cuda.driver.major=510
nvidia.com/cuda.driver.minor=54
nvidia.com/cuda.driver.rev=
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=7
nvidia.com/gfd.timestamp=1660123937
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=4
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=SYS-740GP-TNRT
nvidia.com/gpu.memory=25757220864
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-RTX-A5000
nvidia.com/gpu.replicas=1
nvidia.com/mig.strategy=single
Those are just the labels applied by GFD. I want to know what the plugin has advertised to the kubelet and what the kubelet currently sees as the Capacity and Allocatable of the „nvidia.com/gpu‘ resource type. Also the currently allocated GPUs, which should be available of you run ‚kubectl describe node‘ on the node.
Following is the information about Allocated resources.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 16910m (49%) 8400m (24%)
memory 18910Mi (14%) 18810Mi (14%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 4 4
This still isn’t showing me „Capacity“ and „Allocatable“ of the resource type.
Capacity:
cpu: 34
ephemeral-storage: 1921208612Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131619000Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 34
ephemeral-storage: 1868951736288
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131619000Ki
nvidia.com/gpu: 4
pods: 110
What it is showing me though is that all 4 GPUs are currently assigned to pods. Can you show me the set of pods you have running? Is there a rougue one consuming a GPU somewhere that isn’t part of your deployment.
nvidia-smi output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:31:00.0 Off | Off |
| 30% 44C P8 20W / 230W | 10MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:4B:00.0 Off | Off |
| 30% 42C P8 18W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:B1:00.0 Off | Off |
| 30% 44C P8 16W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:CA:00.0 Off | Off |
| 30% 44C P8 18W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 923258 C tritonserver 13209MiB |
| 2 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 923768 C tritonserver 13209MiB |
| 3 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 926416 C tritonserver 13209MiB |
+-----------------------------------------------------------------------------+
I don't see the process tritonserver
on GPU-0
Following pods are running in this sever.
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cattle-fleet-system gitjob-cc9948fd7-qlrbc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
cattle-monitoring-system loki-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h
cattle-monitoring-system loki-promtail-7mdgj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h
cattle-monitoring-system pushprox-kube-controller-manager-proxy-58f5d844c6-x29m6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
cattle-monitoring-system pushprox-kube-etcd-proxy-57df468748-zrmbx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
cattle-monitoring-system pushprox-kube-proxy-client-sdkv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
cattle-monitoring-system pushprox-kube-proxy-proxy-78b4b985d4-b8d9g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
cattle-monitoring-system rancher-monitoring-kube-state-metrics-5bc8bb48bd-w22xl 100m (0%) 100m (0%) 130Mi (0%) 200Mi (0%) 2d8h
cattle-monitoring-system rancher-monitoring-prometheus-node-exporter-mbqrx 100m (0%) 200m (0%) 30Mi (0%) 50Mi (0%) 3d5h
cattle-system rancher-644bc45f4c-6tsv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
default model-0-5fb7c59b5c-b779l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 33h
default model-0-686f46547c-nthpd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 36h
gpu-operator gpu-feature-discovery-zfq4r 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
gpu-operator gpu-operator-node-feature-discovery-worker-zvhvn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
gpu-operator nvidia-container-toolkit-daemonset-krrgh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
gpu-operator nvidia-dcgm-exporter-9jk7t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
gpu-operator nvidia-device-plugin-daemonset-k9cvz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
gpu-operator nvidia-operator-validator-sl5jz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
kube-system cilium-dtkhx 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 3d5h
kube-system cilium-node-init-ks925 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
kube-system kube-proxy-sbyo-cube-pro-4u-1 250m (0%) 0 (0%) 0 (0%) 0 (0%) 2d7h
kube-system kube-vip-ds-c7wbk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d12h
kube-system rke2-coredns-rke2-coredns-6775f768c8-kwzf8 100m (0%) 100m (0%) 128Mi (0%) 128Mi (0%) 2d8h
kube-system rke2-ingress-nginx-controller-fkqzh 100m (0%) 0 (0%) 90Mi (0%) 0 (0%) 3d5h
kube-system rke2-metrics-server-8574659c85-wmtxh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
locust locust-master-67568cdf46-59xw7 1 (2%) 1 (2%) 4Gi (3%) 4Gi (3%) 36h
locust locust-worker-f9b59d8fb-4lkg7 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust locust-worker-f9b59d8fb-6vmrr 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust locust-worker-f9b59d8fb-bznjq 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust locust-worker-f9b59d8fb-d7qpx 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 35h
locust locust-worker-f9b59d8fb-fhkkv 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust locust-worker-f9b59d8fb-lsn9k 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust locust-worker-f9b59d8fb-qdqqh 1 (2%) 1 (2%) 2Gi (1%) 2Gi (1%) 30h
locust model-0-54f8d6c9bd-466d9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
locust model-0-54f8d6c9bd-m4g2v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
locust model-0-54f8d6c9bd-scpvb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
longhorn-system csi-attacher-8b4cc9cf6-6xx8j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
longhorn-system csi-provisioner-59b7b8b7b8-dmrln 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
longhorn-system csi-resizer-68ccff94-5m5jk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
longhorn-system csi-snapshotter-6d7d679c98-np7vk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
longhorn-system engine-image-ei-d474e07c-vv5rr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
longhorn-system instance-manager-e-8f9d237c 4080m (12%) 0 (0%) 0 (0%) 0 (0%) 3d5h
longhorn-system instance-manager-r-280a2608 4080m (12%) 0 (0%) 0 (0%) 0 (0%) 3d5h
longhorn-system longhorn-csi-plugin-rmjsr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
longhorn-system longhorn-manager-5cqr8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
longhorn-system longhorn-ui-556866b6bb-6jrl4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d8h
Regardless of whether the triton server is running on the GPU or not, some pod must have requested / been given access to all 4 GPUs, otherwise we wouldn't see all 4 of them as Allocated
in the output of describe node
.
What does this show for that node:
kubectl describe pod -A | grep "nvidia.com/gpu"
Following is the output
kubectl describe pod -A | grep 5000
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
--
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Also for 1 pod from the above pod describe has the following event logged, whereas the rest of the 3 pods are scheduled in the expected server through the nodeSelector.
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m41s (x4 over 8m2s) default-scheduler 0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
I'm not worried about the node selector, I want to see which pods have nvidia.com/gpu
resources attached to them.
From all of the evidence I see so far, nothing is operating incorrectly. You just seem to have all 1 GPU already allocated to some other pod on that node, so only 3 of them get assigned to your triton-server deployment.
I think following output would give better info
kubectl describe pod -n locust | egrep "nvidia.com|Node:"
Node: agent-node-4/192.142.122.4
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node: <none>
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Warning FailedScheduling 23m default-scheduler 0/6 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Node: agent-node-4/192.142.122.4
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node: agent-node-4/192.142.122.4
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Node-Selectors: nvidia.com/gpu.product=NVIDIA-RTX-A5000
Node: agent-node-4/192.142.122.4
nvidia.com/gpu: 1
nvidia.com/gpu: 1
Also in https://github.com/NVIDIA/k8s-device-plugin/issues/328#issuecomment-1214162181 all pods scheduled in the nodes are available.
Wonder as per your assumption if 1 GPU is allocated to some other process why it's not displayed in nvidia-smi
output?
Also in nvidia-smi -a
output for GPU 0
displays only the following process.
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1580
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2231
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 4 MiB
whereas for GPU 1
following
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1580
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2231
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 4 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 923258
Type : C
Name : tritonserver
Used GPU Memory : 13209 MiB
Just because a GPU has been allocated to a container doesn't mean it is running anything on it, in which case nvidia-smi
won't help.
In your query above you limited the output to the locust
namespace, but what are these pods:
default model-0-5fb7c59b5c-b779l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 33h
default model-0-686f46547c-nthpd 0 (0%) 0 (0%)
is it possible that one of them has grabbed hold of a GPU on this node.
pods in default namespace was crashlooping since because s3 creds are not passed. Even if it's crashlooping is it possible to grab GPU?
~Let me check the hardware by next week.~ Thanks for commenting @klueska
Yes. Allocation of the GPU happens at scheduling time. So if its crash looping then it’s already been scheduled. And if it asked for a GPU then it is reserved for that pod and not available for anyone else (even if it’s crashing).
thank you for your comments & sharing @klueska.
on the deletion of those pods in crashloop got the GPU 0
pod allocation.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:31:00.0 Off | Off |
| 30% 45C P8 25W / 230W | 12626MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 Off | 00000000:4B:00.0 Off | Off |
| 30% 42C P8 17W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 Off | 00000000:B1:00.0 Off | Off |
| 30% 44C P8 16W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 Off | 00000000:CA:00.0 Off | Off |
| 30% 44C P8 23W / 230W | 13222MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2463377 C tritonserver 12613MiB |
| 1 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 923258 C tritonserver 13209MiB |
| 2 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 923768 C tritonserver 13209MiB |
| 3 N/A N/A 1580 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2231 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 926416 C tritonserver 13209MiB |
+-----------------------------------------------------------------------------+
Description In the below hardware configuration, while trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and the 4th pod was not spinning up and the following error was displayed.
Information about the environment
Deployment file used:
While checking nvidia-smi in the actual system was able to get below output and clearly denotes GPU0 was free to schedule.
Expected behavior
Expecting Triton to schedule pods in all 4 GPUs.
Common error checking:
nvidia-smi -a
on your host2022/08/11 06:57:10 Starting Plugins. 2022/08/11 06:57:10 Loading configuration. 2022/08/11 06:57:10 Initializing NVML. 2022/08/11 06:57:10 Updating config with default resource matching patterns. 2022/08/11 06:57:10 Running with config: { "version": "v1", "flags": { "migStrategy": "single", "failOnInitError": true, "nvidiaDriverRoot": "/", "plugin": { "passDeviceSpecs": true, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } 2022/08/11 06:57:10 Retreiving plugins. 2022/08/11 06:57:10 No MIG devices found. Falling back to mig.strategy=none 2022/08/11 06:57:10 Starting GRPC server for 'nvidia.com/gpu' 2022/08/11 06:57:10 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/08/11 06:57:10 Registered device plugin for 'nvidia.com/gpu' with Kubelet