AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown #23

Open zhaogaolong opened 4 years ago

zhaogaolong commented 4 years ago

版本信息:

k8s: 1.17 gpushare-device-plugin: v2-1.11-aff8a23 nvidia-smi: 440.36

kubectl descript pod < pod name > -n zhaogaolong pod errors log

Events:
  Type     Reason     Age                From                      Message
  ----     ------     ----               ----                      -------
  Normal   Scheduled  <unknown>          default-scheduler         Successfully assigned zhaogaolong/gpu-demo-gpushare-659fd6cbb7-6fc8v to gpu-node
  Normal   Pulling    32s (x4 over 70s)  kubelet, gpu-node  Pulling image "hub.xxxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Pulled     32s (x4 over 70s)  kubelet, gpu-node  Successfully pulled image "hub.xxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Created    31s (x4 over 70s)  kubelet, gpu-node  Created container gpu
  Warning  Failed     31s (x4 over 70s)  kubelet, gpu-node  Error: failed to start container "gpu": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
  Warning  BackOff    10s (x5 over 68s)  kubelet, ggpu-node  Back-off restarting failed container

相同问题:

https://github.com/NVIDIA/nvidia-docker/issues/1042

@cheyang

Joseph516 commented 4 years ago

Is anybody fix this? I got the same problem here. https://github.com/AliyunContainerService/gpushare-scheduler-extender/issues/120#issue-665519945

vio-f commented 2 years ago

I encountered the same issue today. Can anybody help please?

Lanyujiex commented 2 years ago

update your schedule config with gpushare-sch-extender and restart it. you might be able to fix it. @vio-f