Open shashiranjan84 opened 8 months ago
@shashiranjan84 in your time-slicing config you have set replicas: 12
, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?
@shashiranjan84 in your time-slicing config you have set
replicas: 12
, hence those many GPUs are reported as allocatable. Not sure why you think it should be 8?
Shouldn't be it showing how much gpu left? Then what is the difference b/w capacity
and allocatable
? As 4
gpu already allocated, I thought allocatable
should be 8, no?
@shashiranjan84 sorry i missed that you are running pods using GPUs. Yes, it should have been reflected. @klueska any thoughts?
Assuming below are the pods are using GPUs?
eu-west-1-dd-datadog-9fjwr 3/3 Running 0 4h2m
eu-west-1-dd-datadog-cluster-agent-79dbdcdd75-mt97s 1/1 Running 0 4h2m
eu-west-1-dd-datadog-s7527 3/3 Running 0 4h1m
eu-west-1-dd-kube-state-metrics-5b7b7bb44-mz4zm 1/1 Running 0 7d5h
Yes there is one pod running with GPU which can be seen in allocated resource
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (3%) 0 (0%)
memory 10310Mi (33%) 10410Mi (33%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 4
I am seeing this issue as well as not seeing the proper accounting for the allocated on this version of GPU operator. Kernel Version: 4.18.0-513.24.1.el8_9.x86_64 OS Image: RHEL 9.3 Operating System: linux Architecture: x86_64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.26.15+rke2r1 Kube-Proxy Version: v1.26.15+rke2r1 gpu-operator v24.6.0 NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6
I have confirmed beyond a doubt that the pod is using the GPU but the allocated resources count is not updating with the usage. This is the first time I have seen this issue. In GPU Operator 23.9.1 I did not see this problem.
Capacity: cpu: 8 ephemeral-storage: 209702892Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32166136Ki nvidia.com/gpu: 4 pods: 110 Allocatable: cpu: 8 ephemeral-storage: 203998973178 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32166136Ki nvidia.com/gpu: 4 pods: 110 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 3075m (38%) 1100m (13%) memory 11214Mi (35%) 28202Mi (89%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 34C P0 27W / 70W | 1467MiB / 15360MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1708723 C python3 1464MiB | +-----------------------------------------------------------------------------------------+
Allocatable gpu values not correct after configuring time slicing
Relevant node labels
Pod status
1. Quick Debug Information
Kernel Version: 5.10.210-201.852.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.11 Kubelet Version: v1.29.0-eks-5e0fdde Kube-Proxy Version: v1.29.0-eks-5e0fdde gpu-operator v23.9.2
2. Issue or feature description
Allocatable GPU should be 8