NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.45k stars 573 forks source link

mps server error Failed to start : invalid argument #685

Open aphrodite1028 opened 2 weeks ago

aphrodite1028 commented 2 weeks ago

1. Quick Debug Information

2. Issue or feature description

I use k8s-device-plugin 0.15.0 version to deploy in k8s and using a container run matrixMul get error

[Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" and I find like msg using dmesg -T

Cannot map memory with base addr 0x2019c00000 and size of 0x200 pages

and mps-control-daemon log info is

[2024-04-29 11:22:10.421 Control    73] Starting new server 95 for user 0
[2024-04-29 11:22:10.425 Control    73] Accepting connection...
[2024-04-29 11:22:10.441 Control    73] Server encountered a fatal exception. Shutting down
[2024-04-29 11:22:10.446 Control    73] Server 95 exited with status 1
[2024-04-29 11:22:10.447 Control    73] Starting new server 98 for user 0

cuda-nvidia-mps-server log info like

Other 425] Startup Other 425] Connecting to control daemon on socket: /mps/nvidia.com/gpu.shared/pipe/control Other 425] Initializing server process Legacy Server 425] Failed to start : invalid argument rpm -qa |grep nvidia info is

libnvidia-container-tools-1.14.3-1.x86_64
libnvidia-container1-1.14.3-1.x86_64
nvidia-container-runtime-3.14.0-1.noarch
pcp-pmda-nvidia-gpu-5.0.2-5.el8.x86_64
nvidia-container-toolkit-1.14.3-1.x86_64
nvidia-container-toolkit-base-1.14.3-1.x86_64