qingyun1989 commented 3 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

Error: failed to start container "ocr-service": Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: GPU-23a1320e-7dad-e76a-1d08-50ac3be37cbc: unknown device: unknown

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host
[ ] Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
[ ] The vgpu-device-plugin container logs
[ ] The vgpu-scheduler container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version
[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg

chaunceyjiang commented 3 months ago

Can this POD start normally without using hami-device-plugin in your environment?

qingyun1989 commented 3 months ago

cuda

Can this POD start normally without using hami-device-plugin in your environment? It is normal to start directly through "docekr run" without using hami-device-plugin

I will post the relevant information 1、kubernetes and 4Paradigm version

2、kubernetes version：  v1.20.4
4paradigm  component version： 4pd-k8s-vdevice: v2.0 and kube-scheduler:v1.20.4

3、gpu node Driver info

NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4  
docker version  20.10.11

4、Node installation GPU related dependencies

dpkg -l | egrep "nvidia-container|nvidia-docker2"

ii  libnvidia-container-tools              1.7.0-1                           amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.7.0-1                           amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit               1.7.0-1                           amd64        NVIDIA container runtime hook
ii  nvidia-docker2                         2.8.0-1                           all          nvidia-docker CLI wrapper

5、When scheduling through kubernetes, the corresponding gpu pod will report the following event exception

Jul 30 17:10:35 ai-prd-gpu-node-12-54  dockerd[66725]: time="2024-07-30T17:10:35.149294280+08:00" level=error msg="Handler for POST /v1.40/containers/d921cbe5344a1aa7bfdd43e66409c0f14dd324a0f5a04c0c0bc5b1b91001eab8/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: GPU-db948ce3-0dad-6e8f-3113-993710c85929: unknown device: unknown"

Project-HAMi / HAMi

使用此插件调度出现pud状态 RunContainerError，事件显示：nvidia-container-cli: device error，使用docker run是正常起来的 #411

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

dpkg -l | egrep "nvidia-container|nvidia-docker2"