AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

能获取显卡的数量,但是显存为0 #159

Open gxwangit opened 2 years ago

gxwangit commented 2 years ago

Capacity: aliyun.com/gpu-count: 8 aliyun.com/gpu-mem: 0 gpu tesla V100

日志如下 [root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices. I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher. I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher. I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0 I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510 I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0 I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509 I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1 I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0 I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509 I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2 I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0 I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509 I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3 I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0 I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509 I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4 I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0 I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509 I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5 I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0 I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509 I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6 I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0 I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509 I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7 I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510 I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0 I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509 I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5] I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a] I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet 有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev

wsxiaozhang commented 2 years ago

it shows you have 8 GPU cards, each of which has 32Gib memory. You should keep the device plugin deamonset yaml as default as below which sets --memory-unit=GiB. In that case, each GPU card would be reported with 32 virtual devices ( each virtual device stands for 1GiB GPU memory). Otherwise, if set --memory-unit=MiB, it reports 32509 devices for each GPU card, and 8*32509 devices in total, just like your log shown. To transfer huge device info of that large volume will be very probable get the connection between device plugin and kubelet stuck.

command: