Closed wvthoog closed 3 years ago
Ok, first of all, this error:
Verify all devices in group 25 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1
Is reported by QEMU for almost all driver issues. In this case dmesg
will also show start failed. status: 0x1
. If you see any of these you should check journalctl -u nvidia-vgpu-mgr
for any error reported by the driver (which you provided for approach one only).
Second of all, you are using vGPU version 13 (470.X), I have not tested version 13 myself yet (it is quite new) so there is the possibility that it contains bugs or is not working properly with vgpu_unlock. If the issue persists I would recommend trying vGPU versions 11 (450.X) or 12 (460.X) which are known to work.
And also, I am not very familiar with Proxmox, but I will do my best to answer your questions.
Regarding the first approach:
then in Proxmox conf i added the line
args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5,display=off,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0,x-pci-vendor-id=0x10de,x-pci-device-id=0x1b38,x-pci-sub-vendor-id=0x10de,x-pci-sub-device-id=0x11A0' -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5
Where did you get the subsystem ID (x-pci-sub-device-id=0x11A0
) from? For P40-2Q (which I believe is type-47) it should be 0x11E9. Also compared to NVIDIA's example for RHEL you are adding alot of parameters, so maybe try with only vfio-pci
and sysfsdev
:
https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#adding-vgpu-to-red-hat-el-kvm-vm-qemu-cli
Then the lowest level error seems to be this:
error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
Unfortunately I can't say what is causing this without further debugging.
Regarding the second approach:
i create and define the mdev device like i did above and use that uuid in the Proxmox conf.
args: -uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5
This seems unnecessary or wrong since the VM ended up using UUID 00000000-0000-0000-0000-000000000205
instead of the one you provided. What is the intended purpose of adding this UUID to the Proxmox conf?
Then i start up the machine (qm start 205) which results in this output
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:2b:00.0/00000000-0000-0000-0000-000000000205,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000205: Could not enable error recovery for the device
It works and sees the vGPU as a Tesla P40, which is correct.
I'm assuming "It works" means that the VM booted and was able to load the driver for the vGPU, this is good since it rules out any driver issue on the guest side.
But when powering down and starting it back up again gives the(more or less) same error as the first approach
You should provide the output of journalctl -u nvidia-vgpu-mgr
for this case as well.
I'm clearly missing something here. So my question is, which one is the correct approach and how do i get rid of this error
It seems to me that both ways are sane and I would expect both to work. However the first case allows for more in-depth control of individual parameters (such as PCI-IDs and UUIDs), while the second case seems more "user-friendly" since it relies on the web-interface. I would prefer the first case.
Thanks for the helpful information. I've changed a couple of things. First, like you proposed downgraded the driver to 460.73.02
nvidia, 460.73.02, 5.11.22-3-pve, x86_64: installed
Then i created the first mdev using
mdevctl start -u 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 -p 0000:2b:00.0 --type nvidia-47
mdevctl define --auto --uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5
In Proxmox VM's conf i've added this line using only vfio-pci and sysfsdev
args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5'
This gave the following error in journalctl -u nvidia-vgpu-mgr
aug 23 16:43:15 pve nvidia-vgpu-mgr[45395]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57
aug 23 16:43:15 pve nvidia-vgpu-mgr[45395]: error: vmiop_env_log: kvm_plugin_global_init failed with error 0x57
So i added a UUID to the VM's conf
args: -device 'vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5' -uuid 00000000-0000-0000-0000-000000000205
Now it did boot up and it sees the vGPU in the host VM (Ubuntu 20.04) as a Tesla P40
00:02.0 VGA compatible controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
And this is the successful journalctl -u nvidia-vgpu-mgr
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: vgpu_unlock loaded.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 GPU PCI id 00>
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_env_log: Successfully updated env symbols!
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x20801322 failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x2080014b failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0xa0810115 failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2b00
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): vgpu_type : Quadro
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: ######## vGPU Manager Information: ########
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: Driver Version: 460.73.02
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: op_type: 0x2080012f failed.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: (0x0): vGPU migration disabled
aug 23 16:47:04 pve nvidia-vgpu-mgr[50488]: notice: vmiop_log: display_init inst: 0 successful
But them when i shut down the VM and try to boot it back up again i get the same error as in my first message
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/0b5fd3fb-2389-4a22-ba70-52969a26b9d5: vfio 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: error getting device from group 24: Input/output error
Verify all devices in group 24 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1
With this journalctl -u nvidia-vgpu-mgr
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: vgpu_unlock loaded.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 0b5fd3fb-2389-4a22-ba70-52969a26b9d5 GPU PCI id 00>
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=47
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_env_log: Successfully updated env symbols!
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x20801322 failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x2080014b failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0xa0810115 failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): gpu-pci-id : 0x2b00
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): vgpu_type : Quadro
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Framebuffer: 0x74000000
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e9
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: ######## vGPU Manager Information: ########
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: Driver Version: 460.73.02
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: op_type: 0x2080012f failed.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
aug 23 16:49:15 pve nvidia-vgpu-mgr[53859]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
aug 23 16:49:21 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): Timed out (6001 ms) trying to sync
aug 23 16:49:21 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): failed to sync engine
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 7 (init frame copy engine)
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 7
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_log: display_init failed for inst: 0
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
aug 23 16:49:23 pve nvidia-vgpu-mgr[53859]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x65
and dmesg
[ 1380.319062] [nvidia-vgpu-vfio] 0b5fd3fb-2389-4a22-ba70-52969a26b9d5: start failed. status: 0x1
Should i try a different driver ? 450 ? other version of 460 ?
Hmm, odd that it would fail to reboot.
Should i try a different driver ? 450 ? other version of 460 ?
Yes, you can try different versions, 460.32.04, 450.124, 450.89 and 450.80 have some success with Proxmox.
nvidia, 460.73.02, 5.11.22-3-pve, x86_64: installed
Is that the linux kernel version? 5.11? Did you make some modifications to the vGPU driver to make it work on kernels above 5.9?
Okay, i will try those drivers and report back to you.
Yep, it's kernel 5.11 and to compile the nvidia dkms i've used this patch in order for it to build successfully
For anyone browsing through. @wvthoog's last comment on with driver 460.32.04 & the two patches resolved the issue for me as well.
Thanks for being so detailed.
No problem. Made a tutorial for others that may be struggling.
Hi,
i guess this is a question about how to use mdevctl properly. I folowed two tutorials on how to set up a vGPU in Proxmox. One from Craft Computing and the other found on this github page. The first one starts and defines the mdev devices prior to creating a VM, the latter uses the uuid of a previously created mdev device but uses the Proxmox web gui to assign a mdev device
My setup is:
Output of nvidia-smi
output of dmesg|grep -i vpgu
So when i follow the first approach i create the mdev device using
and define it
then in Proxmox conf i added the line
after starting up the machine (qm start 205) i've got this error
output of dmesg
output of journalctl -u nvidia-vgpu-mgr
When i try the second approach
i create and define the mdev device like i did above and use that uuid in the Proxmox conf.
then i i've assigned an mdev device to the VM using the web interface which adds the following line to the VM's conf
Then i start up the machine (qm start 205) which results in this output
It works and sees the vGPU as a Tesla P40, which is correct.
But when powering down and starting it back up again gives the(more or less) same error as the first approach
I'm clearly missing something here. So my question is, which one is the correct approach and how do i get rid of this error
Thanks