GPU passthrough not working on 4.2-rc4 while it worked on 4.2-rc3

matheusd commented 11 months ago

Qubes OS release

4.2-rc4

Brief summary

It seems I found a regression between 4.2-rc3 and 4.2-rc4 related to passthrough GPUs.

My test rig is a desktop computer with both an onboard and offboard GPU (H310M motherboard, NVIDIA Geforce RTX 4070 secondary GPU).

GPU passthrough works for me on rc3, while it fails on rc4 with "No device found" error.

Going through the steps in gaming HVM, #4321 or the associated forum post does not work on rc4, while they worked on rc3.

Steps to reproduce

Clean install of 4.2-rc4
- Enable non-free repo in debian 12 template
Update debian 12 template
Create standalone template based on debian 12 called "gpu-qube"
Adjust "gpu-qube" settings:
- Change memory to 2000
- Disable "Include in memory balance"
- Set kernel to "Provided by Qube"
- Pass GPU to the qube in the "deviices" tab
Install NVIDIA driver on "gpu-qube"
- sudo apt nvidia-driver
Restart "gpu-qube"
Verify GPU is detected
- nvidia-smi

Expected behavior

Display GPU details (name, power draw, etc).

Actual behavior

"No devices found"

Additional Info

To reiterate, this worked on rc3 on the first try but it fails on rc4. Uncertain what changed between them. This is a test rig where I can reinstall any version from scratch, so I can get any necessary dumps needed.

The Qubes install is done on the entire SSD, so there are no issues with it being a secondary OS or something.

lspci -vv shows:

on dom0: "Kernel driver in use: pciback"
on gpu-test: "Kernel driver in use: nvidia"

marmarek commented 11 months ago

Can you post kernel messages from gpu-qube ? And maybe also /var/log/xen/console/guest-gpu-qube-dm.log

matheusd commented 11 months ago

Sure thing. The two dirs are the logs with/without the max-ram-below-4g in /usr/share/qubes/templates/libvirt/xen.xml.

logs.tar.gz

I just had a system-wide hard freeze in rc4, not sure whether it's related to this or not. Never got such a freeze on this machine on 4.1 or on 4.2-rc3 (though I used rc3 only for a few hours for testing).

matheusd commented 11 months ago

Also, I can get the logs for rc3 too if you need them, but that will take some time as I'll have to go through the install process again.

marmarek commented 11 months ago

It's a shame the only error is just NVRM: Xid (PCI:0000:00:06): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. ... But it does hint to use nvidia-bug-report.sh to collect more info. But also, based on the logs I have two ideas:

add permissive=true option via qvm-pci
add pci=nomsi to the qube's kernelopts

DemiMarie commented 11 months ago

Let me guess: proper fix is MSI-X?

marmarek commented 11 months ago

MSI-X should be supported at this point already, but maybe nvidia is not happy about that for some reason (I've seen drivers that behave differently in other areas depending on which interrupt delivery method is used).

halobarrlets commented 11 months ago

I have similar problem after last dom0 update. If I use kernel-latest before 6.5.6-2.qubes.fc37.x86_64 in dom0 then my WiFi adapter doesn't work without permissive=True even though it worked before without this option. I have this log in -dm with kernel 6.4.13-1.qubes.fc37.x86_64:

[2023-10-16 14:05:26] [00:06.0] xen_pt_msixctrl_reg_write: enable MSI-X
[2023-10-16 14:05:26] [00:06.0] msi_msix_update: Updating MSI-X with pirq 151 gvec 0xef gflags 0x0 (entry: 0x0)
[2023-10-16 14:05:26] [00:06.0] msi_msix_update: Updating MSI-X with pirq 150 gvec 0xef gflags 0x0 (entry: 0x1)
[2023-10-16 14:05:26] [00:06.0] msi_msix_update: Updating MSI-X with pirq 149 gvec 0xef gflags 0x0 (entry: 0x2)
[2023-10-16 14:05:26] [00:06.0] msi_msix_update: Updating MSI-X with pirq 148 gvec 0xef gflags 0x0 (entry: 0x3)
[2023-10-16 14:05:26] [00:06.0] msix_set_enable: disabling MSI-X.
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unbind MSI-X with pirq 151, gvec 0xef
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unmap MSI-X pirq 151
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unbind MSI-X with pirq 150, gvec 0xef
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unmap MSI-X pirq 150
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unbind MSI-X with pirq 149, gvec 0xef
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unmap MSI-X pirq 149
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unbind MSI-X with pirq 148, gvec 0xef
[2023-10-16 14:05:26] [00:06.0] msi_msix_disable: Unmap MSI-[2023-10-16 14:05:26] X pirq 148
[2023-10-16 14:05:26] [00:06.0] xen_pt_msixctrl_reg_write: disable MSI-X
[2023-10-16 14:05:26] [00:06.0] pci_msix_read: reading PBA, addr 0x804, offset 0x704
...

If I use 6.5.6-2.qubes.fc37.x86_64 or 6.1.57-1.qubes.fc37.x86_64 in dom0 then WiFi adapter works again without permissive=True with this log in -dm:

[2023-10-16 14:19:46] [00:06.0] xen_pt_msixctrl_reg_write: enable MSI-X
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 151 gvec 0xef gflags 0x0 (entry: 0x0)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 150 gvec 0xef gflags 0x0 (entry: 0x1)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 149 gvec 0xef gflags 0x0 (entry: 0x2)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 148 gvec 0xef gflags 0x0 (entry: 0x3)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 151 gvec 0x24 gflags 0x0 (entry: 0x0)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 150 gvec 0x25 gflags 0x2 (entry: 0x1)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 149 gvec 0x25 gflags 0x0 (entry: 0x2)
[2023-10-16 14:19:46] [00:06.0] msi_msix_update: Updating MSI-X with pirq 148 gvec 0x26 gflags 0x2 (entry: 0x3)
[00:06.0] msi_msix_update: Updating MSI-X with pirq 149 gvec 0x27 gflags 0x2 (entry: 0x2)
[2023-10-16 14:20:16] [00:06.0] msi_msix_update: Updating MSI-X with pirq 150 gvec 0x25 gflags 0x0 (entry: 0x1)

matheusd commented 11 months ago

Logs for rc3. Will try the suggested options next.

gpu-qube-rc3.tar.gz

matheusd commented 11 months ago

* add `permissive=true` option via qvm-pci
* add `pci=nomsi` to the qube's kernelopts

These did not help.

matheusd commented 11 months ago

NVIDIA bug report from rc4:

nvidia-bug-report.log.gz

neowutran commented 11 months ago

I am having the same issue.

I did a backup of my working archlinux HVM ( I was using R4.2 already )
Installed R4.2RC4
Restored the archlinux HVM (and havn't done anything out of the ordinary before trying to activate the GPU)

With nvidia-open drivers: journalctl:

oct. 20 20:18:15 gpu_gaming kernel: nvidia: loading out-of-tree module taints kernel.
oct. 20 20:18:15 gpu_gaming kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
oct. 20 20:18:15 gpu_gaming kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 238
oct. 20 20:18:15 gpu_gaming kernel: NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
oct. 20 20:18:15 gpu_gaming kernel: xen: --> pirq=24 -> irq=40 (gsi=40)
oct. 20 20:18:15 gpu_gaming kernel: nvidia 0000:00:06.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
oct. 20 20:18:15 gpu_gaming kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (archlinux-builder@)  
....
oct. 20 20:19:31 gpu_gaming kernel: NVRM unixCallVideoBIOS: int10h(4f02, 0003) vesa call failed! (01d0, 0003)
oct. 20 20:19:31 gpu_gaming kernel: NVRM nvCheckOkFailedNoLog: Check failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from pRmApi->Control(pRmApi, nv->rmapi.hClient, nv->rmapi.hSubDevice, NV2080_CTRL_CMD_INTERNAL_DI>
oct. 20 20:19:31 gpu_gaming kernel: NVRM unixCallVideoBIOS: int10h(4f03, 0000) vesa call failed! (01d0, 0000)

dm log:

[00:06.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).
[2023-10-20 20:18:17] [00:06.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)
[2023-10-20 20:18:17] [00:06.0] xen_pt_msi_setup: MSI mapped with pirq 87.
[2023-10-20 20:18:17] [00:06.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)

With nvidia drivers: dmesg

[    6.106133] pipewire[684]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[    6.767264] NVRM: Xid (PCI:0000:00:06): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[    6.767272] NVRM: GPU 0000:00:06.0: GPU has fallen off the bus.
[    6.767341] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[    6.767954] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x23:0xf:1426)
[    6.767995] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0

dm log

[2023-10-20 20:32:58] [00:06.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).
[2023-10-20 20:32:58] [00:06.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)
[2023-10-20 20:32:58] [00:06.0] xen_pt_msi_setup: MSI mapped with pirq 87.
[2023-10-20 20:32:58] [00:06.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)
[00:06.0] xen_pt_msi_set_enable: disabling MSI.
[2023-10-20 20:32:58] [00:06.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0
[2023-10-20 20:32:58] [00:06.0] msi_msix_disable: Unmap MSI pirq 87
[2023-10-20 20:32:58] [00:06.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).
[2023-10-20 20:32:58] [00:06.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)
[2023-10-20 20:32:58] [00:06.0] xen_pt_msi_setup: MSI mapped with pirq 87.
[2023-10-20 20:32:58] [00:06.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)
[2023-10-20 20:32:58] [00:06.0] xen_pt_msi_set_enable: disabling MSI.
[2023-10-20 20:32:58] [00:06.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0
[2023-10-20 20:32:58] [00:06.0] msi_msix_disable: Unmap MSI pirq 87
[2023-10-20 20:32:58] [00:06.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).
[2023-10-20 20:32:58] [00:06.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)
[2023-10-20 20:32:58] [00:06.0] xen_pt_msi_setup: MSI mapped with pirq 87.
[2023-10-20 20:32:58] [00:06.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)
[2023-10-20 20:32:59] [00:06.0] xen_pt_msi_set_enable: disabling MSI.
[2023-10-20 20:32:59] [00:06.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0
[2023-10-20 20:32:59] [00:06.0] msi_msix_disable: Unmap MSI pirq 87
[2023-10-20 20:32:59] [00:06.0] xen_pt_msgctrl_reg_write: setup MSI (register: 81).
[2023-10-20 20:32:59] [00:06.0] msi_msix_setup: requested pirq 87 for MSI (vec: 0x0, entry: 0x0)
[2023-10-20 20:32:59] [00:06.0] xen_pt_msi_setup: MSI mapped with pirq 87.
[2023-10-20 20:32:59] [00:06.0] msi_msix_update: Updating MSI with pirq 87 gvec 0x0 gflags 0x7057 (entry: 0x0)
[2023-10-20 20:32:59] [00:06.0] xen_pt_msi_set_enable: disabling MSI.
[2023-10-20 20:32:59] [00:06.0] msi_msix_disable: Unbind MSI with pirq 87, gvec 0x0
[2023-10-20 20:32:59] [00:06.0] msi_msix_disable: Unmap MSI pirq 87

Edit: nvidia driver seems to be borked nvidia-open driver seems to work correctly even if there are errors in dmesg like I pasted in this comment. Yesterday I was unable to make Xorg work due to probably some archlinux weirdness ("Authorization required, but no authorization protocol specified")

matheusd commented 11 months ago

FWIW, downgrading the kernel in dom0 to the rc3 one (6.1.43) does not make a difference.

SaswatPadhi commented 9 months ago

I ran into this issue today. I'm running the 6.6.2 kernel on R4.2

matheusd commented 8 months ago

Just tested on 4.2.0 and it seems (still testing) that using the open nvidia driver does indeed work, including CUDA support. This has been my latest test setup:

Create a standalone qube based on the debian12 template
sudo apt install nvidia-open-kernel-dkms
Reboot
sudo apt install nvidia-cuda-toolkit
Reboot
echo "options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" | sudo tee /etc/modprobe.d/nvidia-gsp.conf
- This comes from the Nvidia CUDA Toolkit docs
Reboot

So far, it seems to be running my test CUDA loads.

ztmzzz commented 8 months ago

does nvidia-smi work?

matheusd commented 8 months ago

does nvidia-smi work?

It does. I can also run a sample CUDA workload successfully (which was my ultimate goal). One thing that hasn't worked so far is actual output to a display connected to the GPU (so no actual gaming, only headless workloads).

matheusd commented 8 months ago

Unfortunately, it looks as if I'm hitting https://github.com/QubesOS/qubes-issues/issues/4321 again. Even though I have the v4.17.2-8 update, and tried the stubdom and xen.xml fixes, it doesn't seem like I can boot the gpu VM with greater than ~2GiB RAM.

renehoj commented 8 months ago

I'm having the same issue, seems like the v4.17.2-8 update broke GPU passthrough for me.

I'm also only able to use around 2GB memory in HVM with GPU attached, before the update I was able to use 24GB without any issues, and the HVM works fine without the GPU attached.

renehoj commented 8 months ago

I have this error in the log

[2024-01-20 18:46:32] wrote 2048 bytes to vchan
[2024-01-20 18:46:32] wrote 1251 bytes to vchan
[2024-01-20 18:46:32] vchan client disconnected
[2024-01-20 18:46:32] processing error - resetting ehci HC
[2024-01-20 18:46:32] random: crng init done
[2024-01-20 18:47:32] pcifront pci-0: Rescanning PCI Frontend Bus 0000:00
[2024-01-20 18:47:32] pci_bus 0000:00: busn_res: [bus 00-ff] is released
[2024-01-20 18:47:32] ------------[ cut here ]------------
[2024-01-20 18:47:32] sysfs group 'power' not found for kobject '0000:00'
[2024-01-20 18:47:32] WARNING: CPU: 0 PID: 10 at 0xffffffff810e9710
[2024-01-20 18:47:32] CPU: 0 PID: 10 Comm: xenwatch Tainted: G                T 5.10.200-xen-stubdom #1
[2024-01-20 18:47:32] RIP: e030:0xffffffff810e9710
[2024-01-20 18:47:32] Code: f6 74 2e 49 89 fc 48 89 df e8 ab fc ff ff 48 89 c3 48 85 c0 75 23 49 8b 14 24 48 8b 75 00 48 c7 c7 ff dd 84 81 e8 87 82 16 00 <0f> 0b 5b 5d 41 5c c3 48 89 df e8 30 d0 ff ff 48 89 ee 48 89 df e8
[2024-01-20 18:47:32] RSP: e02b:ffffc90000253d20 EFLAGS: 00010282
[2024-01-20 18:47:32] RAX: 0000000000000033 RBX: 0000000000000000 RCX: 0000000000000003
[2024-01-20 18:47:32] RDX: 000000000000009f RSI: 00000000ffffefff RDI: 0000000000000200
[2024-01-20 18:47:32] RBP: ffffffff81822800 R08: 0000000000000000 R09: ffffc90000253b68
[2024-01-20 18:47:32] R10: ffffffff81a3efe0 R11: ffffc90000253b60 R12: ffff8880045d2d20
[2024-01-20 18:47:32] R13: ffff8880045d2c00 R14: dead000000000100 R15: ffff8880045d2c28
[2024-01-20 18:47:32] FS:  0000000000000000(0000) GS:ffffffff81a37000(0000) knlGS:0000000000000000
[2024-01-20 18:47:32] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[2024-01-20 18:47:32] CR2: 00007d34fd978230 CR3: 0000000002177000 CR4: 0000000000050660
[2024-01-20 18:47:32] Call Trace:
[2024-01-20 18:47:32]  ? 0xffffffff81251965
[2024-01-20 18:47:32]  ? 0xffffffff810e9710
[2024-01-20 18:47:32]  ? 0xffffffff812422cf
[2024-01-20 18:47:32]  ? 0xffffffff8125471a
[2024-01-20 18:47:32]  ? 0xffffffff812547ca
[2024-01-20 18:47:32]  ? 0xffffffff814009bf
[2024-01-20 18:47:32]  ? 0xffffffff810e9710
[2024-01-20 18:47:32]  ? 0xffffffff81189355
[2024-01-20 18:47:32]  ? 0xffffffff8118926b
[2024-01-20 18:47:32]  ? 0xffffffff81167d3c
[2024-01-20 18:47:32]  ? 0xffffffff81167d68
[2024-01-20 18:47:32]  ? 0xffffffff81167dd2
[2024-01-20 18:47:32]  ? 0xffffffff81175eff
[2024-01-20 18:47:32]  ? 0xffffffff8118cdf0
[2024-01-20 18:47:32]  ? 0xffffffff8118b54e
[2024-01-20 18:47:32]  ? 0xffffffff81189485
[2024-01-20 18:47:32]  ? 0xffffffff812435c6
[2024-01-20 18:47:32]  ? 0xffffffff811748a2
[2024-01-20 18:47:32]  ? 0xffffffff8118926b
[2024-01-20 18:47:32]  ? 0xffffffff8117622e
[2024-01-20 18:47:32]  ? 0xffffffff8117496c
[2024-01-20 18:47:32]  ? 0xffffffff8104896a
[2024-01-20 18:47:32]  ? 0xffffffff810408eb
[2024-01-20 18:47:32]  ? 0xffffffff81040815
[2024-01-20 18:47:32]  ? 0xffffffff8100294f
[2024-01-20 18:47:32] ---[ end trace cec492390cec0693 ]---
[2024-01-20 18:47:32] pcifront pci-0: 22 freeing event channel 4

neowutran commented 8 months ago

I have this error in the log

[2024-01-20 18:46:32] wrote 2048 bytes to vchan
[2024-01-20 18:46:32] wrote 1251 bytes to vchan
[2024-01-20 18:46:32] vchan client disconnected
[2024-01-20 18:46:32] processing error - resetting ehci HC
[2024-01-20 18:46:32] random: crng init done
[2024-01-20 18:47:32] pcifront pci-0: Rescanning PCI Frontend Bus 0000:00
[2024-01-20 18:47:32] pci_bus 0000:00: busn_res: [bus 00-ff] is released
[2024-01-20 18:47:32] ------------[ cut here ]------------
[2024-01-20 18:47:32] sysfs group 'power' not found for kobject '0000:00'
[2024-01-20 18:47:32] WARNING: CPU: 0 PID: 10 at 0xffffffff810e9710
[2024-01-20 18:47:32] CPU: 0 PID: 10 Comm: xenwatch Tainted: G                T 5.10.200-xen-stubdom #1
[2024-01-20 18:47:32] RIP: e030:0xffffffff810e9710
[2024-01-20 18:47:32] Code: f6 74 2e 49 89 fc 48 89 df e8 ab fc ff ff 48 89 c3 48 85 c0 75 23 49 8b 14 24 48 8b 75 00 48 c7 c7 ff dd 84 81 e8 87 82 16 00 <0f> 0b 5b 5d 41 5c c3 48 89 df e8 30 d0 ff ff 48 89 ee 48 89 df e8
[2024-01-20 18:47:32] RSP: e02b:ffffc90000253d20 EFLAGS: 00010282
[2024-01-20 18:47:32] RAX: 0000000000000033 RBX: 0000000000000000 RCX: 0000000000000003
[2024-01-20 18:47:32] RDX: 000000000000009f RSI: 00000000ffffefff RDI: 0000000000000200
[2024-01-20 18:47:32] RBP: ffffffff81822800 R08: 0000000000000000 R09: ffffc90000253b68
[2024-01-20 18:47:32] R10: ffffffff81a3efe0 R11: ffffc90000253b60 R12: ffff8880045d2d20
[2024-01-20 18:47:32] R13: ffff8880045d2c00 R14: dead000000000100 R15: ffff8880045d2c28
[2024-01-20 18:47:32] FS:  0000000000000000(0000) GS:ffffffff81a37000(0000) knlGS:0000000000000000
[2024-01-20 18:47:32] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[2024-01-20 18:47:32] CR2: 00007d34fd978230 CR3: 0000000002177000 CR4: 0000000000050660
[2024-01-20 18:47:32] Call Trace:
[2024-01-20 18:47:32]  ? 0xffffffff81251965
[2024-01-20 18:47:32]  ? 0xffffffff810e9710
[2024-01-20 18:47:32]  ? 0xffffffff812422cf
[2024-01-20 18:47:32]  ? 0xffffffff8125471a
[2024-01-20 18:47:32]  ? 0xffffffff812547ca
[2024-01-20 18:47:32]  ? 0xffffffff814009bf
[2024-01-20 18:47:32]  ? 0xffffffff810e9710
[2024-01-20 18:47:32]  ? 0xffffffff81189355
[2024-01-20 18:47:32]  ? 0xffffffff8118926b
[2024-01-20 18:47:32]  ? 0xffffffff81167d3c
[2024-01-20 18:47:32]  ? 0xffffffff81167d68
[2024-01-20 18:47:32]  ? 0xffffffff81167dd2
[2024-01-20 18:47:32]  ? 0xffffffff81175eff
[2024-01-20 18:47:32]  ? 0xffffffff8118cdf0
[2024-01-20 18:47:32]  ? 0xffffffff8118b54e
[2024-01-20 18:47:32]  ? 0xffffffff81189485
[2024-01-20 18:47:32]  ? 0xffffffff812435c6
[2024-01-20 18:47:32]  ? 0xffffffff811748a2
[2024-01-20 18:47:32]  ? 0xffffffff8118926b
[2024-01-20 18:47:32]  ? 0xffffffff8117622e
[2024-01-20 18:47:32]  ? 0xffffffff8117496c
[2024-01-20 18:47:32]  ? 0xffffffff8104896a
[2024-01-20 18:47:32]  ? 0xffffffff810408eb
[2024-01-20 18:47:32]  ? 0xffffffff81040815
[2024-01-20 18:47:32]  ? 0xffffffff8100294f
[2024-01-20 18:47:32] ---[ end trace cec492390cec0693 ]---
[2024-01-20 18:47:32] pcifront pci-0: 22 freeing event channel 4

( Hi, This is a different issue. I just debugged the setup of someone. It seems that on some hardware, the previous TOLUD patch need to be removed for the v4.17.2-8 update to work. https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/61 https://github.com/QubesOS/qubes-issues/issues/4321#issuecomment-1902480115 )

neowutran commented 8 months ago

@matheusd did you do more tests / do you have some time to do some test ? you say the nvidia driver is working in RC3 ? I think some update broke the official proprietary nvidia driver. I would guess it is either xen update or stubdom update. If you have some time, can you try to take a up to date R4.2 system and downgrade xen and stubdom to the version available in RC3 ?

matheusd commented 8 months ago

did you do more tests / do you have some time to do some test ?

I can do it later today, sure thing. Anything to help out trying to trace the root cause for this.

you say the nvidia driver is working in RC3 ?

That's right.

If you have some time, can you try to take a up to date R4.2 system and downgrade xen and stubdom to the version available in RC3 ?

My test system currently has 4.2.0 installed. Do you know which packages and versions specifically I should donwgrade?

neowutran commented 8 months ago

Between "qubes-vmm-xen" (the repository that provide the hypervisor package) version 4.17.2-8 (current version) and version 4.17.1-4 (the last xen version where I can confirm the nvidia driver work) something broke the nvidia driver. So need to test to find which qubes-vmm-xen release broke the nvidia driver, then find which commit, then find which line in xen hypervisor. I just started the process, but it is going to take some time. ( I only see 2 commits that seems interesting regarding this issue, but since computer are magical-trolling-devices I won't rush it)

Edit: I was wrong. The commit that created this issue seems to be https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/commit/ca8a488cbad8a30432c457f4f5ea81369cb69fcf . @matheusd on your test setup you can try to run "sudo qubes-dom0-update xen-hvm-stubdom-linux-full-4.2.6 xen-hvm-stubdom-linux-4.2.6" to confirm if it solve your issue

matheusd commented 8 months ago

4.2.6 does indeed seem to fix it, while 4.2.7 gives the 'fallen off the bus' error.

qubesos-bot commented 6 months ago

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template. To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot commented 6 months ago

Automated announcement from builder-github

The package vmm-xen-stubdom-linux has been pushed to the r4.2 stable repository for the Debian template. To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

QubesOS / qubes-issues