我使用的是v0.9.0.0这个版本，build之后，部署为daemon服务到 GPU节点, 报device-split-count等几个参数未定义，去掉这几个参数后，POD可正常在GPU节点running；但看日志找到不到NVML,GPU节点是P100,求联系求指导 - Githubissues

4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.

Apache License 2.0

513 stars 93 forks source link

我使用的是v0.9.0.0这个版本，build之后，部署为daemon服务到 GPU节点, 报device-split-count等几个参数未定义，去掉这几个参数后，POD可正常在GPU节点running；但看日志找到不到NVML,GPU节点是P100,求联系求指导 #9

Closed AlexPei closed 2 years ago

AlexPei commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host
[ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
[ ] The k8s-device-plugin container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version
[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)

archlitchi commented 2 years ago

不要用0.9.0.0啊，最好用最新版本的镜像