4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

core dump when request 2 or more gpus with Tesla T4 #24

Open ryan1051 opened 2 years ago

ryan1051 commented 2 years ago

1. Issue or feature description

It's ok when request 1 gpu in yaml. But when request more than 1, the output of nvidia-smi is below: image The output of nvidia-smi in host machine is ok.

In another machine with GeForce RTX 2070 SUPER ,it's all right when request 2 gpus. image but when I run application locally , it abort due to :

[4pdvGPU ERROR (pid:697 thread=140106827071488 context.c:189)]: cuCtxGetDevice Not Found. tid=140106827071488 ctx=0x239601906000:0x23960041a000
 home/limengxuan/work/libcuda_override/src/cuda/context.c:189: cuCtxGetDevice: Assertion `0' failed.

2. Steps to reproduce the issue

ubuntu1~20.04 + microk8s + Tesla T4 GPU + 510driver

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

ryan1051 commented 2 years ago

And memory and fault isolation are provided?