4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
513 stars 93 forks source link

切分10份,但是VGPU显存无变化 NVIDIA A100 #19

Closed Dripman closed 2 years ago

Dripman commented 2 years ago

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system annotations: deprecated.daemonset.template.generation: '2' spec: selector: matchLabels: name: nvidia-device-plugin-ds template: metadata: creationTimestamp: null labels: name: nvidia-device-plugin-ds annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: volumes:

Thu May 5 09:50:57 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.60.02 Driver Version: 510.60.02 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:0C.0 Off | 0 | | N/A 28C P0 51W / 400W | 413MiB / 40960MiB | 3% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 77977 C 411MiB | +-----------------------------------------------------------------------------+

显存仍然是40GB

archlitchi commented 2 years ago

已经修复了,可以更新一下

Dripman commented 2 years ago

目前该问题已经解决 但dmesg 时发现有错误提示nvidia-smi[25708]: segfault at 0 ip 00007f2f72d3a14a sp 00007ffe3a9005b8 error 4 in libc-2.27.so[7f2f72bbb000+1e7000] 不知道是否会对使用产生影响

Dripman commented 2 years ago

已经修复了,可以更新一下

非常感谢~

archlitchi commented 2 years ago

这个应该不影响使用吧,如果出现问题的话直接提issue或者加我wx:xuanzong4493

不过如果你们打算进生产的话,推荐用这个https://github.com/4paradigm/k8s-vgpu-scheduler