启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.

Apache License 2.0

489 stars 93 forks source link

大佬，K8S小白请教个问题，还请麻烦指导一下

1. Issue or feature description

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

2. Steps to reproduce the issue

master节点8核16G腾讯云虚拟机，node节点20核80G腾讯云虚拟机带一张nvidia T4显卡。操作系统为ubuntu server 18.04 node节点安装如下安装docker、nvidia-docker2并开启vgpu，使用的镜像是latest，参数均为默认（尝试过修改参数但是结果一样）

执行如下操作进入pod内部执行nvidia-smi结果如下

在带显卡的宿主机上执行nvidia-smi是没有问题的 docker版本20.10 K8S版本1.19.0使用kubeadm安装，kubelet版本也是1.19.0 docker info结果如下，daemon.json也已经配置了runtime和default-runtime为nvidia

4paradigm / k8s-vgpu-scheduler

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped) #11

1. Issue or feature description

2. Steps to reproduce the issue