4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

commited image can not run in another node. #8

Open haijohn opened 3 years ago

haijohn commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

commited image can not run in another node.

2. Steps to reproduce the issue

  1. start pod with gpu enabled
  2. commit container to image and push to registry
  3. start pod with commited image in another node container can not run with following error
    Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: 
    exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-caba9b00-6386-2c33-7834-646ef2692cb7: unknown device\\\\n\\\"\"": unknown

    3. Information to attach (optional if deemed irrelevant)

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

archlitchi commented 3 years ago

你是在另一个节点上用docker裸起的吗?可以的话,上slack上聊吧

haijohn commented 3 years ago

你是在另一个节点上用docker裸起的吗?可以的话,上slack上聊吧

是的,另一个节点上没有用vGPU,如果另一个节点也用了vGPU好像就没有这个问题了

archlitchi commented 3 years ago

你是在另一个节点上用docker裸起的吗?可以的话,上slack上聊吧

是的,另一个节点上没有用vGPU,如果另一个节点也用了vGPU好像就没有这个问题了

嗯,如果用docker裸起的话,不能用--gpus申请显卡,得用 docker run -it --runtime=nvidia -e=NVIDIA_VISIBLE_DEVICES=0,1,2,3(对应显卡序号,或者all代表所有显卡) {image} 这样的方式来配置~