NVIDIA / kubevirt-gpu-device-plugin

NVIDIA k8s device plugin for Kubevirt
BSD 3-Clause "New" or "Revised" License
209 stars 66 forks source link

Device plugin can't detect the vgpus #78

Open esposem opened 10 months ago

esposem commented 10 months ago

I currently have Openshift 4.13 with the Openshift Virtualization (CNV) installed. I installed the nvidia drivers through https://github.com/vladikr/ocp-nvidia-vgpu-installer, and they work as expected.

I gave to the HyperConverged yaml file the following:

spec:
  mediatedDevicesConfiguration:
    mediatedDevicesTypes: 
    - nvidia-258
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID RTX6000-3Q"
      resourceName: "nvidia.com/GRID_RTX6000-3Q"
      externalResourceProvider: true

obviously checking that nvidia-258 exists:

$ cd /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types
$ cat nvidia-258/available_instances 
8

Then I created 2 mdev devices

$ UUID=$(uuidgen);
$ echo "${UUID}" > nvidia-258/create;
$ mdevctl define --auto --uuid $UUID;
$ mdevctl list

Then I installed the kubevirt-gpu-device-plugin, but when I inspect the nodes log I see

2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Nvidia device 0000:05:00.0
2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Iommu Map map[]
2023/08/29 09:36:53 Device Map map[]
2023/08/29 09:36:53 vGPU Map map[GRID_RTX6000-3Q:[{21ad712a-f454-498c-84d5-4116f3723c01} {43922f20-6573-4d6b-9223-a2ca02f83b29}]]
2023/08/29 09:36:53 GPU vGPU Map map[0000:05:00.0:[21ad712a-f454-498c-84d5-4116f3723c01 43922f20-6573-4d6b-9223-a2ca02f83b29]]
2023/08/29 09:36:53 Could not find NVIDIA device with id: GRID_RTX6000-3Q
2023/08/29 09:36:53 DP Name GRID_RTX6000-3Q
2023/08/29 09:36:53 Devicename GRID_RTX6000-3Q
2023/08/29 09:36:58 [GRID_RTX6000-3Q] Error registering with device plugin manager: context deadline exceeded
2023/08/29 09:36:58 Error starting GRID_RTX6000-3Q device plugin: context deadline exceeded

And I can't run any VMI/VM as once I schedule one, it is never scheduled as it doesn't find any vgpu available when I provide the following to the yaml file:

spec:
      gpus:
      - deviceName: nvidia.com/GRID_RTX6000-3Q
        name: vgpu1

What did I do wrong?

esposem commented 10 months ago

@cdesiniotis could you please take a look at this?

rthallisey commented 5 months ago

@esposem what version of the device-plugin were you using? Usually this error appear when the pci.ids in the device plugin are out of date. This should go away in the newer versions.