Open qingyun1989 opened 3 months ago
Can this POD start normally without using hami-device-plugin in your environment?
cuda
Can this POD start normally without using hami-device-plugin in your environment? It is normal to start directly through "docekr run" without using hami-device-plugin
I will post the relevant information 1、kubernetes and 4Paradigm version
2、kubernetes version: v1.20.4
4paradigm component version: 4pd-k8s-vdevice: v2.0 and kube-scheduler:v1.20.4
3、gpu node Driver info
NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4
docker version 20.10.11
4、Node installation GPU related dependencies
ii libnvidia-container-tools 1.7.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.7.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-toolkit 1.7.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.8.0-1 all nvidia-docker CLI wrapper
5、When scheduling through kubernetes, the corresponding gpu pod will report the following event exception
Jul 30 17:10:35 ai-prd-gpu-node-12-54 dockerd[66725]: time="2024-07-30T17:10:35.149294280+08:00" level=error msg="Handler for POST /v1.40/containers/d921cbe5344a1aa7bfdd43e66409c0f14dd324a0f5a04c0c0bc5b1b91001eab8/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-db948ce3-0dad-6e8f-3113-993710c85929: unknown device: unknown"
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
2. Steps to reproduce the issue
Error: failed to start container "ocr-service": Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: GPU-23a1320e-7dad-e76a-1d08-50ac3be37cbc: unknown device: unknown
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
uname -a
dmesg