DualCoder / vgpu_unlock

Unlock vGPU functionality for consumer grade GPUs.
MIT License
4.55k stars 425 forks source link

Verify all devices in group 25 are bound to vfio-<bus> or pci-stub and not already in use - mdev question #71

Closed wvthoog closed 3 years ago

wvthoog commented 3 years ago

Hi,

i guess this is a question about how to use mdevctl properly. I folowed two tutorials on how to set up a vGPU in Proxmox. One from Craft Computing and the other found on this github page. The first one starts and defines the mdev devices prior to creating a VM, the latter uses the uuid of a previously created mdev device but uses the Proxmox web gui to assign a mdev device

My setup is:

Output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63       Driver Version: 470.63       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:2B:00.0 Off |                  N/A |
|  0%   40C    P8     8W / 120W |     23MiB /  6143MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:2C:00.0 Off |                  N/A |
| 30%   30C    P8    N/A /  75W |     15MiB /  4095MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

output of dmesg|grep -i vpgu

[   24.792596] vGPU unlock patch applied.

So when i follow the first approach i create the mdev device using

mdevctl start -u 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 -p 0000:2b:00.0 --type nvidia-47

and define it

mdevctl define --auto --uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

then in Proxmox conf i added the line

args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5,display=off,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0,x-pci-vendor-id=0x10de,x-pci-device-id=0x1b38,x-pci-sub-vendor-id=0x10de,x-pci-sub-device-id=0x11A0' -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

after starting up the machine (qm start 205) i've got this error

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5,display=off,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0,x-pci-vendor-id=0x10de,x-pci-device-id=0x1b38,x-pci-sub-vendor-id=0x10de,x-pci-sub-device-id=0x11A0: vfio 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: error getting device from group 25: Input/output error
Verify all devices in group 25 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

output of dmesg

[67364.199643] [nvidia-vgpu-vfio] 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: start failed. status: 0x1 

output of journalctl -u nvidia-vgpu-mgr

aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: vgpu_unlock loaded.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 GPU PCI id >
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_env_log: Successfully updated env symbols!
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: op_type: 0x20801322 failed.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: op_type: 0x2080014b failed.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: op_type: 0xa0810115 failed.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2b00
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): vgpu_type : Quadro
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: ######## vGPU Manager Information: ########
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: Driver Version: 470.63
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: op_type: 0x2080012f failed.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
aug 23 08:58:08 pve nvidia-vgpu-mgr[1853774]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
aug 23 08:58:14 pve nvidia-vgpu-mgr[1853774]: error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
aug 23 08:58:14 pve nvidia-vgpu-mgr[1853774]: error: vmiop_log: (0x0): failed to sync engine
aug 23 08:58:16 pve nvidia-vgpu-mgr[1853774]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 7 (init frame copy engine)
aug 23 08:58:16 pve nvidia-vgpu-mgr[1853774]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 7
aug 23 08:58:16 pve nvidia-vgpu-mgr[1853774]: error: vmiop_log: display_init failed for inst: 0
aug 23 08:58:16 pve nvidia-vgpu-mgr[1853774]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
aug 23 08:58:16 pve nvidia-vgpu-mgr[1853774]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x65

When i try the second approach

i create and define the mdev device like i did above and use that uuid in the Proxmox conf.

args: -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

then i i've assigned an mdev device to the VM using the web interface which adds the following line to the VM's conf

hostpci0: 0000:2b:00.0,mdev=nvidia-47,pcie=1,x-vga=1

Then i start up the machine (qm start 205) which results in this output

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:2b:00.0/00000000-0000-0000-0000-000000000205,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000205: Could not enable error recovery for the device

It works and sees the vGPU as a Tesla P40, which is correct.

But when powering down and starting it back up again gives the(more or less) same error as the first approach

mdev instance '00000000-0000-0000-0000-000000000205' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:2b:00.0/00000000-0000-0000-0000-000000000205,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0,x-vga=on: vfio 00000000-0000-0000-0000-000000000205: error getting device from group 24: Input/output error
Verify all devices in group 24 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

I'm clearly missing something here. So my question is, which one is the correct approach and how do i get rid of this error

Thanks

DualCoder commented 3 years ago

Ok, first of all, this error:

Verify all devices in group 25 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

Is reported by QEMU for almost all driver issues. In this case dmesg will also show start failed. status: 0x1. If you see any of these you should check journalctl -u nvidia-vgpu-mgr for any error reported by the driver (which you provided for approach one only).

Second of all, you are using vGPU version 13 (470.X), I have not tested version 13 myself yet (it is quite new) so there is the possibility that it contains bugs or is not working properly with vgpu_unlock. If the issue persists I would recommend trying vGPU versions 11 (450.X) or 12 (460.X) which are known to work.

And also, I am not very familiar with Proxmox, but I will do my best to answer your questions.

Regarding the first approach:

then in Proxmox conf i added the line

args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5,display=off,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0,x-pci-vendor-id=0x10de,x-pci-device-id=0x1b38,x-pci-sub-vendor-id=0x10de,x-pci-sub-device-id=0x11A0' -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

Where did you get the subsystem ID (x-pci-sub-device-id=0x11A0) from? For P40-2Q (which I believe is type-47) it should be 0x11E9. Also compared to NVIDIA's example for RHEL you are adding alot of parameters, so maybe try with only vfio-pci and sysfsdev: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#adding-vgpu-to-red-hat-el-kvm-vm-qemu-cli

Then the lowest level error seems to be this:

error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync

Unfortunately I can't say what is causing this without further debugging.

Regarding the second approach:

i create and define the mdev device like i did above and use that uuid in the Proxmox conf.

args: -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

This seems unnecessary or wrong since the VM ended up using UUID 00000000-0000-0000-0000-000000000205 instead of the one you provided. What is the intended purpose of adding this UUID to the Proxmox conf?

Then i start up the machine (qm start 205) which results in this output

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:2b:00.0/00000000-0000-0000-0000-000000000205,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000205: Could not enable error recovery for the device

It works and sees the vGPU as a Tesla P40, which is correct.

I'm assuming "It works" means that the VM booted and was able to load the driver for the vGPU, this is good since it rules out any driver issue on the guest side.

But when powering down and starting it back up again gives the(more or less) same error as the first approach

You should provide the output of journalctl -u nvidia-vgpu-mgr for this case as well.

I'm clearly missing something here. So my question is, which one is the correct approach and how do i get rid of this error

It seems to me that both ways are sane and I would expect both to work. However the first case allows for more in-depth control of individual parameters (such as PCI-IDs and UUIDs), while the second case seems more "user-friendly" since it relies on the web-interface. I would prefer the first case.

wvthoog commented 3 years ago

Thanks for the helpful information. I've changed a couple of things. First, like you proposed downgraded the driver to 460.73.02

nvidia, 460.73.02, 5.11.22-3-pve, x86_64: installed

Then i created the first mdev using

mdevctl start -u 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 -p 0000:2b:00.0 --type nvidia-47
mdevctl define --auto --uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5

In Proxmox VM's conf i've added this line using only vfio-pci and sysfsdev

args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5'

This gave the following error in journalctl -u nvidia-vgpu-mgr

aug 23 16:43:15 pve nvidia-vgpu-mgr[45395]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57
aug 23 16:43:15 pve nvidia-vgpu-mgr[45395]: error: vmiop_env_log: kvm_plugin_global_init failed with error 0x57

So i added a UUID to the VM's conf

args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5' -uuid 00000000-0000-0000-0000-000000000205

Now it did boot up and it sees the vGPU in the host VM (Ubuntu 20.04) as a Tesla P40

00:02.0 VGA compatible controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)

And this is the successful journalctl -u nvidia-vgpu-mgr

aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: vgpu_unlock loaded.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 GPU PCI id 00>
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: Successfully updated env symbols!
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x20801322 failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x2080014b failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0xa0810115 failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2b00
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): vgpu_type : Quadro
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: ######## vGPU Manager Information: ########
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: Driver Version: 460.73.02
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x2080012f failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): vGPU migration disabled
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: display_init inst: 0 successful

But them when i shut down the VM and try to boot it back up again i get the same error as in my first message

kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5: vfio 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: error getting device from group 24: Input/output error
Verify all devices in group 24 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

With this journalctl -u nvidia-vgpu-mgr

aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: vgpu_unlock loaded.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 GPU PCI id 00>
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: Successfully updated env symbols!
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x20801322 failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x2080014b failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0xa0810115 failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2b00
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): vgpu_type : Quadro
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: ######## vGPU Manager Information: ########
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: Driver Version: 460.73.02
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x2080012f failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
aug 23 16:49:21 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
aug 23 16:49:21 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): failed to sync engine
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 7 (init frame copy engine)
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 7
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: display_init failed for inst: 0
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x65

and dmesg

[ 1380.319062] [nvidia-vgpu-vfio] 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: start failed. status: 0x1 

Should i try a different driver ? 450 ? other version of 460 ?

DualCoder commented 3 years ago

Hmm, odd that it would fail to reboot.

Should i try a different driver ? 450 ? other version of 460 ?

Yes, you can try different versions, 460.32.04, 450.124, 450.89 and 450.80 have some success with Proxmox.

nvidia, 460.73.02, 5.11.22-3-pve, x86_64: installed

Is that the linux kernel version? 5.11? Did you make some modifications to the vGPU driver to make it work on kernels above 5.9?

wvthoog commented 3 years ago

Okay, i will try those drivers and report back to you.

Yep, it's kernel 5.11 and to compile the nvidia dkms i've used this patch in order for it to build successfully

wvthoog commented 3 years ago

Well, it seems to be working now. Tried driver 460.32.04 and patched it with these two files (patch1 / patch2)

Now i can shut down and restart the VM with the vGPU attached without any issue.

smurmann commented 3 years ago

For anyone browsing through. @wvthoog's last comment on with driver 460.32.04 & the two patches resolved the issue for me as well.

Thanks for being so detailed.

wvthoog commented 3 years ago

No problem. Made a tutorial for others that may be struggling.

https://wvthoog.nl/proxmox-7-vgpu/