NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 303 forks source link

Error on PCI Passthrough using new L40 Openshift 4.1x #548

Closed clrfuerst closed 1 year ago

clrfuerst commented 1 year ago

1. Quick Debug Checklist

kubevirt-hyperconfig

spec: permittedHostDevices: pciHostDevices:

oc describe node XXXX Capacity: nvidia.com/26b5: 1 Allocatable: nvidia.com/26b5: 1

1. Issue or feature description

Getting the following error when trying to use a L40 GPU with PCI Passthrough to a Virtual Machine - which then won't assign the GPU or start the VM.

From the nvidia-sandbox-device-plugin-daemonset 2023/07/10 19:41:03 Nvidia device 0000:e2:00.0 2023/07/10 19:41:03 Iommu Group 128 2023/07/10 19:41:03 Device Id 26b5 2023/07/10 19:41:03 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory 2023/07/10 19:41:03 Iommu Map map[128:[{0000:e2:00.0}]] 2023/07/10 19:41:03 Device Map map[26b5:[128]] 2023/07/10 19:41:03 vGPU Map map[] 2023/07/10 19:41:03 GPU vGPU Map map[] 2023/07/10 19:41:03 Error: Could not find device name for device id: 26b5 2023/07/10 19:41:03 DP Name 26b5 2023/07/10 19:41:03 Devicename 26b5 2023/07/10 19:41:03 26b5 Device plugin server ready

virt-launcher pod trying to allocate the device server error. command SyncVMI failed: "failed to create GPU host-devices: the number of GPU/s do not match the number of devices:\nGPU: [{26b5 nvidia.com/26b5 }]\nDevice: []"

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_26B5 not set for resource nvidia.com/26b5","pos":"addresspool.go:50","timestamp":"2023-07-11T16:11:34.667518Z"}

2. Steps to reproduce the issue

Trying to launch a VM using an L40 GPU vs an A40 GPU using pci-passthrough

cdesiniotis commented 1 year ago

@clrfuerst can you try using the latest kubevirt-gpu-device-plugin image, v1.2.2? Set sandboxDevicePlugin.version=v1.2.2 in ClusterPolicy. Note, the pci id database was updated in v1.2.2 so the L40 GPU should be named with its device name (rather than device id) -- you will have to update your hyperconverged configuration accordingly.

clrfuerst commented 1 year ago

Thank you for the pointer, this seems to have done the trick.