2232729885 commented 6 months ago

i use helm to install k8s-vgpu-scheduler, set devicePlugin.deviceSplitCount = 5. after deployed successfully, i run 'kubectl describe node ', i can see the allocatable resources 'nvidia.com/gpu' count 40 (it has 8 A40 card in machine). Then i create 6 pod, every pod assign 1 'nvidia.com/gpu'， but when i create a pod which needs 3 'nvidia.com/gpu'，the k8s said the pod can't not be schedulerd.

the logs of vgpu-scheduler is showed below, it seems said only 2 gpu card can usable？ I0313 00:58:35.594437 1 score.go:65] "devices status" I0313 00:58:35.594467 1 score.go:67] "device status" device id="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" device detail={"Id":"GPU-0707087e-8264-4ba4-bc45-30c70272ec4a","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594519 1 score.go:67] "device status" device id="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" device detail={"Id":"GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce","Index":1,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594542 1 score.go:67] "device status" device id="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" device detail={"Id":"GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4","Index":2,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594568 1 score.go:67] "device status" device id="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" device detail={"Id":"GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e","Index":3,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594600 1 score.go:67] "device status" device id="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" device detail={"Id":"GPU-56967eb2-30b7-c808-367a-225b8bd8a12e","Index":4,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594639 1 score.go:67] "device status" device id="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" device detail={"Id":"GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb","Index":5,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594671 1 score.go:67] "device status" device id="GPU-e731cd15-879f-6d00-485d-d1b468589de9" device detail={"Id":"GPU-e731cd15-879f-6d00-485d-d1b468589de9","Index":6,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594693 1 score.go:67] "device status" device id="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" device detail={"Id":"GPU-865edbf8-5d63-8e57-5e14-36682179eaf6","Index":7,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594725 1 score.go:90] "Allocating device for container request" pod="default/gpu-pod-2" card request={"Nums":5,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0} I0313 00:58:35.594757 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=5 device index=7 device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594800 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594829 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=4 device index=6 device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594850 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594869 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=5 device="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" I0313 00:58:35.594889 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=4 device="GPU-e731cd15-879f-6d00-485d-d1b468589de9" I0313 00:58:35.594911 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=3 device="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" I0313 00:58:35.594929 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=2 device="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" I0313 00:58:35.594948 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=1 device="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" I0313 00:58:35.594966 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=0 device="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" I0313 00:58:35.594989 1 score.go:211] "calcScore:node not fit pod" pod="default/gpu-pod-2" node="gpu-230"

the kubectl describe node gpu-230 said:

the nvidia-smi said:

so somebody can solve this issue? thanks

2232729885 commented 6 months ago

the used helm install command is:

add vgpu repo

helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler

helm install chart

helm upgrade --install vgpu vgpu-charts/vgpu \ -n kube-system \ --set scheduler.kubeScheduler.imageTag=v1.27.2 \ --set devicePlugin.deviceSplitCount=5 \ --set devicePlugin.deviceMemoryScaling=1 \ --set devicePlugin.migStrategy=none \ --set resourceName=nvidia.com/gpu \ --set resourceMem=nvidia.com/gpumem \ --set resourceMemPercentage=nvidia.com/gpumem-percentage \ --set resourceCores=nvidia.com/gpucores \ --set resourcePriority=nvidia.com/priority \ --set devicePlugin.tolerations[0].key=nvidia.com/gpu \ --set devicePlugin.tolerations[0].operator=Exists \ --set devicePlugin.tolerations[0].effect=NoSchedule

2232729885 commented 6 months ago

4paradigm / k8s-vgpu-scheduler

parameter devicePlugin.deviceSplitCount does not work #35

add vgpu repo

helm install chart