Error on PCI Passthrough using new L40 Openshift 4.1x

clrfuerst commented 1 year ago

1. Quick Debug Checklist

OpenShift 4.10
NVIDIA GPU Operator 23.3.2
OpenShift Virtualization 4.10.9

kubevirt-hyperconfig

spec: permittedHostDevices: pciHostDevices:

resourceName: "nvidia.com/GA102GL_A40" pciDeviceSelector: "10DE:2235" externalResourceProvider: true
resourceName: "nvidia.com/26b5" pciDeviceSelector: "10DE:26B5" externalResourceProvider: true

oc describe node XXXX Capacity: nvidia.com/26b5: 1 Allocatable: nvidia.com/26b5: 1

1. Issue or feature description

Getting the following error when trying to use a L40 GPU with PCI Passthrough to a Virtual Machine - which then won't assign the GPU or start the VM.

From the nvidia-sandbox-device-plugin-daemonset 2023/07/10 19:41:03 Nvidia device 0000:e2:00.0 2023/07/10 19:41:03 Iommu Group 128 2023/07/10 19:41:03 Device Id 26b5 2023/07/10 19:41:03 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory 2023/07/10 19:41:03 Iommu Map map[128:[{0000:e2:00.0}]] 2023/07/10 19:41:03 Device Map map[26b5:[128]] 2023/07/10 19:41:03 vGPU Map map[] 2023/07/10 19:41:03 GPU vGPU Map map[] 2023/07/10 19:41:03 Error: Could not find device name for device id: 26b5 2023/07/10 19:41:03 DP Name 26b5 2023/07/10 19:41:03 Devicename 26b5 2023/07/10 19:41:03 26b5 Device plugin server ready

virt-launcher pod trying to allocate the device server error. command SyncVMI failed: "failed to create GPU host-devices: the number of GPU/s do not match the number of devices:\nGPU: [{26b5 nvidia.com/26b5 }]\nDevice: []"

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_26B5 not set for resource nvidia.com/26b5","pos":"addresspool.go:50","timestamp":"2023-07-11T16:11:34.667518Z"}

2. Steps to reproduce the issue

Trying to launch a VM using an L40 GPU vs an A40 GPU using pci-passthrough

cdesiniotis commented 1 year ago

@clrfuerst can you try using the latest kubevirt-gpu-device-plugin image, v1.2.2? Set sandboxDevicePlugin.version=v1.2.2 in ClusterPolicy. Note, the pci id database was updated in v1.2.2 so the L40 GPU should be named with its device name (rather than device id) -- you will have to update your hyperconverged configuration accordingly.

clrfuerst commented 1 year ago

Thank you for the pointer, this seems to have done the trick.

NVIDIA / gpu-operator