两张GPU，只识别了一张卡 - Githubissues

4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.

Apache License 2.0

489 stars 93 forks source link

两张GPU，只识别了一张卡 #16

Closed absolutelyZero closed 2 years ago

absolutelyZero commented 2 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:1A:00.0 Off | 0 | | N/A 33C P0 24W / 250W | 0MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:68:00.0 Off | 0 | | N/A 27C P0 23W / 250W | 4MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

如上，服务器上总共有两张V100，但是使用device-plugin 后，只会针对0号卡进行分割。

下面为分割的参数： args:

'--fail-on-init-error=false'
'--device-split-count=4'
'--device-memory-scaling=2'
'--device-cores-scaling=4'

describe gpunode后，也只得到 4 vgpu，而不是8 vgpu:

Capacity: cpu: 36 ephemeral-storage: 3478455808Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 131500528Ki nvidia.com/gpu: 4 pods: 110 Allocatable: cpu: 35600m ephemeral-storage: 3478455808Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 127399425539 nvidia.com/gpu: 4 pods: 110

docker version: 20.10.12

k8s version: v1.19.9

absolutelyZero commented 2 years ago

已经改用vgpu-scheduler方案，能正常识别出两张物理卡并虚拟出对应的vGPU