4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped) #11

Closed detongz closed 2 years ago

detongz commented 2 years ago

大佬,K8S小白请教个问题,还请麻烦指导一下

1. Issue or feature description

启用vGPU之后pod内执行nvidia-smi报错Segmentation fault (core dumped)

2. Steps to reproduce the issue

master节点8核16G腾讯云虚拟机,node节点20核80G腾讯云虚拟机带一张nvidia T4显卡。操作系统为ubuntu server 18.04 node节点安装如下安装docker、nvidia-docker2并开启vgpu,使用的镜像是latest,参数均为默认(尝试过修改参数但是结果一样) image

执行如下操作 image 进入pod内部执行nvidia-smi结果如下 image

在带显卡的宿主机上执行nvidia-smi是没有问题的 docker版本20.10 K8S版本1.19.0使用kubeadm安装,kubelet版本也是1.19.0 docker info结果如下,daemon.json也已经配置了runtime和default-runtime为nvidia image

archlitchi commented 2 years ago

可以在容器内执行env看一下吗

detongz commented 2 years ago

我今天重试了一下,当我的集群内GPU超过一张卡的时候,vgpu是可以生效的,运行任务也是没问题的,但是只有一张卡的时候会出现上面的错误,也可能是参数设置的问题?我等下班之后如果复现出来了把env贴出来看下。感谢大佬了

archlitchi commented 2 years ago

好的,后续上slack上交流吧,更方便一些