AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.42k stars 308 forks source link

运行时报错 OCI runtime create failed #58

Open viknana opened 5 years ago

viknana commented 5 years ago

当我运行时报错nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,但是我运行nvidia-device-plugin-daemonset可以正常通过测试

wzdutd commented 5 years ago

你好!我也碰到这个问题了,我试了下nvidia-device-plugin-daemonset,结果还是不行。如果我只运行nvidia-device-plugin-daemonset而不运行gpushare-device-plugin,执行 create -f 1.yaml (创建binpack pod)都没任何结果输出。

所以,你能否提供下比较详细的操作指导,以及的环境配置(GPU型号等)。非常感谢

Sakuralbj commented 5 years ago

我之前也遇到过一样的问题,我当时是集群的调度器并没有启用gpushare-scheduler-extender,这样pod的annotation中就不会产生该pod应该分配的device ID,在device-plugin执行实际分配时,则会报unknown device id的错误。你可以describe一下你的pod,看annotation中是否有ALIYUN_COM_GPU_MEM_IDX的值.

cicijohn1983 commented 5 years ago

你好 我在运行示例的时候报这个错误nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,请问一下怎么解决,和显卡驱动有关系吗?谢谢

pan87232494 commented 5 years ago

你好, 我用kubespray 2.10 装的1.14的k8s, 使用nvidia device plugin beta2 可以用, 但是想多个container复用显卡, 所以用了现在这个插件, 但是也看到下面这个错误, 但是为什么这里是显示没有1GB

Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-1MiB-to-run\\\\n\\\"\"": unknown

并且我的显卡是1080TI 还有一个960, 这个显示不太正确吧?

    [bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl inspect gpushare
    NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  PENDING(Allocated)  GPU Memory(GiB)
    k8s-demo-slave2  192.168.2.140  0/1                    0/1                    1                   1/2

    [bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl-inspect-gpushare 
    NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU Memory(GiB)
    k8s-demo-slave2  192.168.2.140  0/1                    0/1                    0/2
    --------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/2 (0%)  

    nvidia-smi 
    Thu Oct 10 15:03:38 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 960     Off  | 00000000:17:00.0 Off |                  N/A |
    | 36%   29C    P8     7W / 120W |      0MiB /  2002MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 108...  Off  | 00000000:66:00.0 Off |                  N/A |
    | 14%   37C    P8    25W / 270W |      0MiB / 11175MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------
illusion202 commented 5 years ago

@pan87232494, yaml贴出来看看呢

HistoryGift commented 4 years ago

我也遇到同样的问题了,我是在官方的demo上改的,你们谁解决了这个问题? apiVersion: apps/v1 kind: Deployment

metadata: name: binpack-1 labels: app: binpack-1

spec: replicas: 1 selector: matchLabels: app: binpack-1 template: metadata: labels: app: binpack-1 spec: nodeName: worker2.testgpu.testgpu.com containers:

HistoryGift commented 4 years ago

我这边将gpushare-scheduler-extender 启动到master节点上,kube-scheduler由于是命令启动的,所以修改了service, ExecStart=/usr/local/bin/kube-scheduler \ --address=0.0.0.0 \ --master=http://127.0.0.1:8080 \ --leader-elect=true \ --v=2 \ --use-legacy-policy-config=true \ --policy-config-file=/etc/kubernetes/scheduler-policy-config.json 并且将json中的127.0.0.1改成了master的IP { "kind": "Policy", "apiVersion": "v1", "extenders": [ { "urlPrefix": "http://masterIP:32766/gpushare-scheduler", "filterVerb": "filter", "bindVerb": "bind", "enableHttps": false, "nodeCacheCapable": true, "managedResources": [ { "name": "aliyun.com/gpu-mem", "ignoredByScheduler": false } ], "ignorable": false } ] } 其余的plugin插件都是按照install.md部署的,部署pod时还是报错 Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run\\n\\"\"": unknown 但是kubectl-inspect-gpushare-v2 能够看到资源分出去了,这个原因可能出现在哪里? NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) PENDING(Allocated) GPU Memory(GiB) worker2.testgpu.testgpu.com worker2.testgpu.testgpu.com 0/11 0/11 10 10/22

Allocated/Total GPU Memory In Cluster: 10/22 (45%)

哪位大佬帮忙看看?

baozhiming commented 3 years ago

大家解决这个问题了吗,好急呀

debMan commented 3 years ago

Having same issue

zhichenghe commented 2 years ago

same issue

zhichenghe commented 2 years ago

Warning Failed 40s (x4 over 85s) kubelet Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: no-gpu-has-6025MiB-to-run: unknown device: unknown Normal Pulled 40s (x3 over 85s) kubelet Container image "reg.deeproute.ai/deeproute-simulation-services/gpu-player:v2" already present on machine