I use k8s-device-plugin 0.15.0 version to deploy in k8s and using a container run matrixMul get error
[Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
and I find like msg using dmesg -T
Cannot map memory with base addr 0x2019c00000 and size of 0x200 pages
and mps-control-daemon log info is
[2024-04-29 11:22:10.421 Control 73] Starting new server 95 for user 0
[2024-04-29 11:22:10.425 Control 73] Accepting connection...
[2024-04-29 11:22:10.441 Control 73] Server encountered a fatal exception. Shutting down
[2024-04-29 11:22:10.446 Control 73] Server 95 exited with status 1
[2024-04-29 11:22:10.447 Control 73] Starting new server 98 for user 0
cuda-nvidia-mps-server log info like
Other 425] Startup Other 425] Connecting to control daemon on socket: /mps/nvidia.com/gpu.shared/pipe/control Other 425] Initializing server process Legacy Server 425] Failed to start : invalid argument
rpm -qa |grep nvidia info is
1. Quick Debug Information
2. Issue or feature description
I use k8s-device-plugin 0.15.0 version to deploy in k8s and using a container run matrixMul get error
[Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
and I find like msg using dmesg -TCannot map memory with base addr 0x2019c00000 and size of 0x200 pages
and mps-control-daemon log info is
cuda-nvidia-mps-server log info like
Other 425] Startup Other 425] Connecting to control daemon on socket: /mps/nvidia.com/gpu.shared/pipe/control Other 425] Initializing server process Legacy Server 425] Failed to start : invalid argument
rpm -qa |grep nvidia info is