Open viknana opened 5 years ago
你好!我也碰到这个问题了,我试了下nvidia-device-plugin-daemonset,结果还是不行。如果我只运行nvidia-device-plugin-daemonset而不运行gpushare-device-plugin,执行 create -f 1.yaml (创建binpack pod)都没任何结果输出。
所以,你能否提供下比较详细的操作指导,以及的环境配置(GPU型号等)。非常感谢
我之前也遇到过一样的问题,我当时是集群的调度器并没有启用gpushare-scheduler-extender,这样pod的annotation中就不会产生该pod应该分配的device ID,在device-plugin执行实际分配时,则会报unknown device id的错误。你可以describe一下你的pod,看annotation中是否有ALIYUN_COM_GPU_MEM_IDX的值.
你好 我在运行示例的时候报这个错误nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,请问一下怎么解决,和显卡驱动有关系吗?谢谢
你好, 我用kubespray 2.10 装的1.14的k8s, 使用nvidia device plugin beta2 可以用, 但是想多个container复用显卡, 所以用了现在这个插件, 但是也看到下面这个错误, 但是为什么这里是显示没有1GB
Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-1MiB-to-run\\\\n\\\"\"": unknown
并且我的显卡是1080TI 还有一个960, 这个显示不太正确吧?
[bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
k8s-demo-slave2 192.168.2.140 0/1 0/1 1 1/2
[bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl-inspect-gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB)
k8s-demo-slave2 192.168.2.140 0/1 0/1 0/2
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/2 (0%)
nvidia-smi
Thu Oct 10 15:03:38 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 00000000:17:00.0 Off | N/A |
| 36% 29C P8 7W / 120W | 0MiB / 2002MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:66:00.0 Off | N/A |
| 14% 37C P8 25W / 270W | 0MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------
@pan87232494, yaml贴出来看看呢
我也遇到同样的问题了,我是在官方的demo上改的,你们谁解决了这个问题? apiVersion: apps/v1 kind: Deployment
metadata: name: binpack-1 labels: app: binpack-1
spec: replicas: 1 selector: matchLabels: app: binpack-1 template: metadata: labels: app: binpack-1 spec: nodeName: worker2.testgpu.testgpu.com containers:
aliyun.com/gpu-mem: 1
Allocated/Total GPU Memory In Cluster: 10/22 (45%)
哪位大佬帮忙看看?
大家解决这个问题了吗,好急呀
Having same issue
same issue
Warning Failed 40s (x4 over 85s) kubelet Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: no-gpu-has-6025MiB-to-run: unknown device: unknown Normal Pulled 40s (x3 over 85s) kubelet Container image "reg.deeproute.ai/deeproute-simulation-services/gpu-player:v2" already present on machine
当我运行时报错nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,但是我运行nvidia-device-plugin-daemonset可以正常通过测试