QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
528 stars 46 forks source link

Qubes 4.1 not booting on ThinkPad E495 #7096

Open shaaati opened 2 years ago

shaaati commented 2 years ago

Qubes OS release

It seems that any Qubes 4.1 version is affected. In particular, I tried:

Brief summary

The in-place runs through to the last step where you have to restart the system. If I do so, the screen stays black. The same happens when I try to boot from any of the above-mentioned ISO images. Using additional kernel parameters I can see output that indicates it might be an issue with a PCI device (see below).

Steps to reproduce

Try to boot any Qubes 4.1 release on a Lenovo ThinkPad E495. To view additional output, add loglvl=all vga=,keep console=vga to the kernel options to see the output of the Xen boot process.

Expected behavior

The system should boot up into Qubes.

Actual behavior

During the boot process, GRUB is shown and I can choose one of the available options. No matter what I choose, the screen stays black and nothing happens, no matter how long I wait. I know that this device seems to not get along too well with Qubes/Xen (several minor issues with Qubes 4.0). During my research for potential causes I found the above-mentioned command line parameters in another issue regarding AMD systems and tried them out. The result can be seen in the following image.

The system gets to the step where PCI devices are added and hangs once it reaches device id 0000:03:00.0. When booting from another Linux OS (e.g., my Qubes 4.0 installation that I backed up before the in-place-upgrade and could therefore re-image) I can see that this seems to be the SD card reader (03:00.0 SD Host controller: O2 Micro, Inc. SD/MMC Card Reader Controller (rev 01)).

I am a layperson when it comes to Xen (I am glad I got this far), but am I correct that Xen might not like my internal SD card reader? Interestingly, the older Xen version in Qubes 4.0 did not show similar problems. Is there anything I could do to get around this (for my setup) blocking issue for Qubes 4.1? I tried searching for kernel parameters to disable PCI devices on boot but didn't find anything useful. In the long term, I would of course prefer if I could use my SD card reader in Qubes as well.

DemiMarie commented 2 years ago

Can you disable the SD card reader in your firmware (BIOS) menu?

shaaati commented 2 years ago

Can you disable the SD card reader in your firmware (BIOS) menu?

Thank you @DemiMarie for bringing that up. I had thought about it during troubleshooting, but then it slipped my mind. I can indeed deactivate the SD card reader in the UEFI menu and this helps in advancing the boot process, but it does not complete successfully. In EFI boot mode, it stops at the line Linux agpgart interface v0.103. When searching for this on the Internet I find multiple threads on different Linux forums talking about graphics bugs and kernel panics. I can't see a kernel panic itself but maybe that is because the system hangs before it is able to print something? I just read about the possibility to try out agp=off as a kernel parameter. Haven't tried that out yet but I will report once I reboot to try it. In legacy boot mode, there is just a black screen without any debug output available.

Interestingly, my Qubes 4.0 installation now also fails to boot. Comparably to what is described in #5416, the boot process hangs after entering the LUKS passphrase at the step Starting Switch Root.... After a few seconds, the systems reboots. Even more interestingly, this issue is also related to graphics on an AMD-based system. I fail to understand why this would affect me now because just two days ago, Qubes 4.0 was working fine as I stated in my first post (and as far as I know, it didn't install any updates on its last boot).

This issue might however also be caused by the fact that I recently mounted my Qubes system from within a live Linux to access the VM contents. I image that somehow this could cause a dirty filesystem which impedes the boot process. For now, I wouldn't give this too much attention (unless it helps to diagnose the Qubes 4.1 issue, that is).

Edit: tested with agp=off but that didn't help. On the contrary, I am not sure the boot process shows a consistent behavior. One time, it stopped much earlier in the boot sequence at around 450 seconds (see screenshot). With VGA console enabled, the agpgart message would show up at around 760 seconds.

Another time, I got stuck in a never-ending loop of BUG: soft lockup - CPU#3 stuck for 23s! call traces. I didn't experiment further due to the high amount of time it takes for one single boot attempt with VGA console and verbose output. If you have suggestions what I could do to further identify the issue I will gladly try to gather more output.

DemiMarie commented 2 years ago

This issue might however also be caused by the fact that I recently mounted my Qubes system from within a live Linux to access the VM contents. I image that somehow this could cause a dirty filesystem which impedes the boot process. For now, I wouldn't give this too much attention (unless it helps to diagnose the Qubes 4.1 issue, that is).

What storage pool are you using? LVM2 (the default) is quite fragile, sadly.

shaaati commented 2 years ago

Yup, a default LVM2 installation. No worries though, I could always revert back to the backup I took before trying the upgrade. However, a setup without any suspend capability doesn't really work for my use case which is why I decided to give Qubes 4.1 a shot.

Obviously, a non-booting OS is even less useful. Sadly, it seems the E495 was a bad choice for a system I wanted to run Qubes on :/.

If there is anything I could do to diagnose hardware issues I'll be glad to help, but I think for the time being I will revert to some traditional Linux issue as my main OS on this device because I really need a working laptop.

DemiMarie commented 2 years ago

My first suggestion would be to try R4.1 with BTRFS, and disable various stuff in the firmware.

Vultucs commented 2 years ago

I think this is a common issue with newer Ryzen processors on Xen try dom0_max_vcpus=1 dom0_vcpus_pin.

If that does work and updating("sudo qubes-dom0-update") doesn't let you remove them after, install with the latest ISO here: https://forum.qubes-os.org/t/qubesos-4-1-alpha-signed-weekly-builds/3601

11/23 build = xen-hypervisor-4.14.3-1 11/28 12/4 build = xen-hypervisor-4.14.3-5

If that doesn't work try dom0_max_vcpus=1 dom0_vcpus_pin clocksource=tsc tsc_mode=2 cpufreq=xen:performance max_cstate=0 you could also try upgrading/downgrading your motherboard BIOS if possible.

Had the same issue on my 5900x until I installed Qubes-20211128-kernel-latest-x86_64.iso. clocksource=tsc tsc_mode=2 cpufreq=xen:performance max_cstate=0 lets my 5900x boost properly, this will drain battery life though.

shaaati commented 2 years ago

I think this is a common issue with newer Ryzen processors on Xen try dom0_max_vcpus=1 dom0_vcpus_pin.

A big thumbs up to you @Vultucs! At least this brought me to a working installer on the rc-2 ISO. I'm a bit short on time and therefore haven't continued from there.

But this looks promising 🙂.

shaaati commented 2 years ago

Okay, Xen-4.14.3-5 seems to have fixed "something". When I re-tried the in-place upgrade, it installed this version by itself, so no need to download the weekly ISO.

This xen version no longer hangs when I activate the SD card reader in UEFI. (On a side note, I found that deactivating it was the reason why 4.0 did no longer boot 🤔).

dom0_max_vcpus=1 dom0_vcpus_pin is required for a successful boot.

After booting into Qubes 4.1, however, (almost) no VM is functional. I managed to get sys-net up and running after playing around with the PCI devices and the kernel used by the VM. It is now running with kernel 5.15.5-1 and only Ethernet attached to it. At least once during my tests, I have seen the WiFi symbol pop up in the tray bar. Using sys-net as updateVM, I was able to download the Fedora 34 template.i have no idea why this worked. When I now try to install packages in dom0 (I tried kernel and kernel-latest), sometimes the shell window opens and shows a dnf command hanging early in the repo sync phase, sometimes I can't see any reaction. sys-net is not dead though, because "Open console in qube" gives me a working text console. Network seems to be somewhat working, but definitely not okay (e.g., ping gives me a response almost immediately for the first packet to an internal IP address, but will then hang and not send further packets).

After adding clocksource=tsc tsc_mode=2 cpufreq=xen:performance max_cstate=0 to the boot parameters, my "vault" VM started up for the first time and seems to be usable. However, any VM that is attached to another VM refuses to boot./var/log/libvirt/libxl/libxl-driver.log shows libxl_device.c:1146:device_backend_callback: Domain <NUMBER>:unable to add device with path /local/domain/4/backend/vif/<NUMBER>/0 and three more messages related to adding and removing of this vif device.

There is more that I could report but I think it all can be summarised as: the system is acting weird and not all behavior is reproducible. I will try to download the most recent weekly release and do a clean install with both regular and latest kernel.

Update: did a clean installation using the weekly ISO from December 18 (regular kernel, not kernel latest), and this one seems to work fine if I supply the dom0_max_vcpus parameters.

While waiting for the installation to finish, I found #6055, which sounded an awful lot like the problems I described above. The main difference is that this thread is about the Ryzen 4000 series, while my Laptop runs on a Ryzen 3000. I tried booting with clocksource=tsc tsc=unstable hpetbroadcast=0 (instead of limiting the CPUs on dom0) as described in https://github.com/QubesOS/qubes-issues/issues/6055#issuecomment-799823559 and apparently, this works too. So I guess this issue can be closed as the problem has been found(?). While there seems to be an UEFI update for the T14, I can't find a related update for the E495 :/. As stated in #6055, this issue could be fixed by Xen 4.15.

(Suspend, which was my initial reason for upgrading, still does not work in 4.1. sigh)

DemiMarie commented 2 years ago

Reopening as Qubes OS should work out of the box, without any need to fiddle with the Xen or kernel command lines. @marmarek should I try backporting the relevant patches?

(Suspend, which was my initial reason for upgrading, still does not work in 4.1. sigh)

If the problem is s0ix this could take a while to fix, sorry.

0spinboson commented 2 years ago

@shaaati as an aside: actually, your CPU has a zen+ (ryzen 2k) architecture, but they renumbered the laptop parts for sales reasons, and my own ryzen 2700x (also zen+) also had some issues with this. Ryzen 3000 and 4000 series (desktop and laptop SKUs resp.) also both have / had issues, though.

isodude commented 2 years ago

Regarding suspend Make sure you select S3 (Linux) as suspend mode in BIOS instead of S2Idle(Windows). Also, I'm still struggling with Ryzen 7 4000, even tried out Xen 4.15, latest Mesa, latest kernel.. Hopefully my strive there will solve some issues on other AMD platforms.

You are correct that it's a BIOS bug to solve regarding hpet broadcast etc. Have you written something about it on the Lenovo Forums? Maybe this is related? https://forums.lenovo.com/t5/Lenovo-C-E-K-M-N-and-V-Series-Notebooks/Lenovo-E945-Performance-are-poor/m-p/5082864?page=1#5347410 Xen can only do so much with the TSC, but if the BIOS is broken.. well, Lenovo needs to fix it. As they did on the 4000 platform! Contact your Lenovo support!

But wait, was the laptop sluggish prior to Q4.1? Is it slow booting without Xen?