NVIDIA / kubevirt-gpu-device-plugin

NVIDIA k8s device plugin for Kubevirt
BSD 3-Clause "New" or "Revised" License
226 stars 67 forks source link

GPU passthrough is not working #123

Open iveskim opened 2 days ago

iveskim commented 2 days ago

os:Ubuntu 22.04.3 LTS uname -r: 5.15.0-78-generic

nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.15.0-78-generic root=UUID=48a7eec0-0c0f-424a-8794-67bfeaa74d64 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto intel_iommu=on vfio_pci.disable_vga=1 vfio-pci.ids=10de:2684

lspci -nnk -d 10de: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) Subsystem: NVIDIA Corporation Device [10de:167c] Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22ba] (rev a1) Subsystem: NVIDIA Corporation Device [10de:167c] Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel

k8s pod kube-system nvidia-kubevirt-gpu-dp-daemonset-vt52d 0/1 CrashLoopBackOff

pod error: `Events: Type Reason Age From Message


Normal SandboxChanged 52m (x2 over 53m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 50m (x4 over 52m) kubelet Container image "harbor.gcs.local/nvidia/kubevirt-gpu-device-plugin:v1.2.4" already present on machine Normal Created 50m (x4 over 52m) kubelet Created container nvidia-kubevirt-gpu-dp-ctr Warning Failed 50m (x4 over 52m) kubelet Error: failed to start container "nvidia-kubevirt-gpu-dp-ctr": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown Warning BackOff 49m (x11 over 52m) kubelet Back-off restarting failed container nvidia-kubevirt-gpu-dp-ctr in pod nvidia-kubevirt-gpu-dp-daemonset-vt52d_kube-system(47708979-6fa0-4947-a108-b96f01ed9eda) Normal SandboxChanged 40m (x2 over 41m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Created 38m (x4 over 40m) kubelet Created container nvidia-kubevirt-gpu-dp-ctr Warning Failed 38m (x4 over 40m) kubelet Error: failed to start container "nvidia-kubevirt-gpu-dp-ctr": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown Normal Pulled 37m (x5 over 40m) kubelet Container image "harbor.gcs.local/nvidia/kubevirt-gpu-device-plugin:v1.2.4" already present on machine Warning BackOff 66s (x172 over 40m) kubelet Back-off restarting failed container nvidia-kubevirt-gpu-dp-ctr in pod nvidia-kubevirt-gpu-dp-daemonset-vt52d_kube-system(47708979-6fa0-4947-a108-b96f01ed9eda)`

nvrm error: NVRM: The NVIDIA probe routine was not called for 1 device(s)

iveskim commented 2 days ago
  1. two gpu cards use one gpu passthrough no problem
  2. two gpu cards use two gpu will error

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] [10de:2208] (rev a1) Subsystem: LeadTek Research Inc. GA102 [GeForce RTX 3080 Ti] [107d:2208] Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1) Subsystem: LeadTek Research Inc. GA102 High Definition Audio Controller [107d:2208] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel a1:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] [10de:2208] (rev a1) Subsystem: LeadTek Research Inc. GA102 [GeForce RTX 3080 Ti] [107d:2208] Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia a1:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1) Subsystem: LeadTek Research Inc. GA102 High Definition Audio Controller [107d:2208] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel

`Events: Type Reason Age From Message


Warning NodeNotReady 9m22s node-controller Node is not ready Normal SandboxChanged 3m49s (x2 over 4m30s) kubelet Pod sandbox changed, it will be killed and re-created. Normal Pulled 2m15s (x4 over 3m47s) kubelet Container image "harbor.gcs.local/nvidia/kubevirt-gpu-device-plugin:v1.2.6" already present on machine Normal Created 2m15s (x4 over 3m46s) kubelet Created container nvidia-kubevirt-gpu-dp-ctr Warning Failed 2m11s (x4 over 3m42s) kubelet Error: failed to start container "nvidia-kubevirt-gpu-dp-ctr": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown Warning BackOff 67s (x11 over 3m39s) kubelet Back-off restarting failed container nvidia-kubevirt-gpu-dp-ctr in pod nvidia-kubevirt-gpu-dp-daemonset-wjjb9_kube-system(563dfb3b-93a2-437a-89d3-81007d4e4e02)`

**because will not load nvidia module

lsmod | grep nvidia is null**