cyberus-technology / virtualbox-kvm

KVM Backend for VirtualBox. With our current development model, we cannot easily accept pull requests here. If you'd like to contribute, feel free to reach out to us, we are happy to find a solution.
GNU General Public License v3.0
952 stars 119 forks source link

Code 43 in guest when passing through NVIDIA GPU #34

Open AnErrupTion opened 3 months ago

AnErrupTion commented 3 months ago

Bug Description

When following the guide over here, adapting it to passthrough a dedicated GPU, a code 43 error can be observed after installing the GPU drivers in the guest system using Device Manager.

How to Reproduce

  1. Follow the previously linked guide, ensuring that:
    • vfio-pci is correctly bound to the GPU
    • The proper memlock modifications are done in /etc/security/limits.conf
    • The proper permissions are set throughout /dev/vfio/*
    • The VFIO device is attached to the guest using --attachvfio
  2. Install the GPU drivers, here using the latest NVIDIA 560.81 drivers
  3. Reboot and observe the code 43 error (NOTE: a pretty long ~5-6 seconds freeze can also be observed when booting up the VM. I'm assuming it tries to load the NVIDIA driver but fails to do so)

VM configuration

Guest OS configuration details:

Host OS details:

Logs

snue commented 3 months ago

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

AnErrupTion commented 3 months ago

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

I was pretty sure I had already disabled it. But, either way, adding the command line parameter didn't do anything, although I now see this in dmesg:

 Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

tpressure commented 3 months ago

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

AnErrupTion commented 3 months ago

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

Unfortunately, it didn't solve the issue.

tpressure commented 3 months ago

@AnErrupTion can you post new logs with split lock disabled?

AnErrupTion commented 3 months ago

Ah yes, my bad. Here they are:

dmesg.log Windows 11-2024-08-14-17-15-07.log

tpressure commented 3 months ago

It looks a little bit better and the guest is definitively trying to use the GPU:

00:00:07.099476 VFIO: RegisterBar 0xf0000000 
00:00:07.099500 VFIO: RegisterBar 0x800000000 
00:00:07.099501 VFIO: RegisterBar 0x900000000 
00:00:07.099503 VFIO: RegisterBar 0x6000 
00:00:07.099809 VFIO: Activate MSI count: 1

and

[   43.766761] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

Can you upload the output of lspci -vvvn please?

AnErrupTion commented 3 months ago

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Can you upload the output of lspci -vvvn please?

Alright, here's the output (when ran as root): lspci.log

tpressure commented 3 months ago

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

AnErrupTion commented 3 months ago

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

Is there a way of knowing which ones does it apply? I can fire up a QEMU VM if needed.

AnErrupTion commented 3 months ago

Also, I guess I forgot to mention one interesting bit: when I went to check for updates in the VM, Windows Update did not download the NVIDIA driver and I had to download it manually (but then it installed fine afterwards). And, when I went to Device Manager, it said that the driver used is not the same one as the POSTed graphics driver, or something like this. None of this happened with QEMU either.

snue commented 3 months ago

There are quite some nvidia quirks in QEMU. The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/20180129202326.9417.71344.stgit@gimli.home/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

AnErrupTion commented 3 months ago

There are quite some nvidia quirks in QEMU. The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/20180129202326.9417.71344.stgit@gimli.home/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

I have tried to disable MSI by setting MSISupported in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\PCI\VEN_10DE&DEV_25A2&SUBSYS_13FC1043&REV_A1\3&267a616a&0&80\Device Parameters\Interrupt Management\MessageSignaledInterruptProperties to 0 instead of 1, but unfortunately, the problem still persists. One interesting thing though is that, in the utility I was using (MSI mode utility v3.1), my GPU doesn't actually appear on the list of devices, even though it's present in the registry and it also supports MSI (though that last part shouldn't matter because devices that don't support MSI also appear in the program's list):

image