GalliumOS / galliumos-distro

Docs, issues, and artwork sources for GalliumOS
https://galliumos.org/
GNU General Public License v2.0
345 stars 11 forks source link

Suspend/Resume on Kaby Lake : \o/ #596

Open elthariel opened 3 years ago

elthariel commented 3 years ago

TLDR; I propose to use this issue to track available information and progress about suspend/resume on the Kaby Lake platform.

Hi,

I recently acquired an HP Chromebook 15 (aka Syndra) and when resuming from Linux I face the 'Corrupt OS' message and have to reboot to CrOS to fix the issue.

I'm trying to investigate the issue but I wasn't able to find any open issue tracking it or any reference to someone working on this. If you read this and have a tiny lead about where to start looking to solve this, please share ! (wink @MrChromeBox)

elthariel commented 3 years ago

If you happen to run into the 'Chrome OS missing or corrupted', please press TAB and share the recovery_reason and active firmware id :) You can then reboot, go into Chrome OS, start a shell and run sudo crossytem dev_boot_legacy=1 to be able to boot into GNU/Linux again

I've tried it with the default galliumos kernel (4.16.18-galliumos) and the stock ubuntu kernel (5.4.0-52-lowlatency) and I get: recovery_reason : 0x2b / 0x2b Secure NVRAM (TPM) initialization error active firmware id: Google_Nami.10775.101.0

elthariel commented 3 years ago

afaict, after the failure path in chromiumos code is the following:

Here, either the CRC check fails, or the version isn't right and it triggers the recovery mode

elthariel commented 3 years ago

I was able to suspend/resume using a kernel built from the chromium os tree using a mix of their nami board kernel version/config and the stock gallium kernel config.

The kernel tree is here: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-4.4

I've been suspending using echo mem | sudo tee /sys/power/state, which is what is used by the powerd_suspend script.

ATM, closing triggers a different kind of suspend which can never be resumed, the keyboard backlight comes back but the screen stays black (but I don't have the corrupt os error)

A semi-educated guess based on my kernel source code lecture during the build time is that the chromium OS kernel uses the firmware to suspend instead of ACPI signal/messsages/interruptions/whatever when compiled with the right options

elthariel commented 3 years ago

Lid thing can be fixed (mostly?) by updating by creating a file in /etc/systemd/sleep.conf with the following content:

SuspendState=mem
MrChromebox commented 3 years ago

this is a known issue with CR50 devices running stock firmware.

On resume from S3/suspend, Google's verified boot code is looking to the TPM to confirm the previous boot was successful, which requires the OS to set a flag in the TPM. If the flag isn't set, vboot assumes the previous boot failed, and bails to recovery mode and clears the crossystem flags.

Up until recently (kernel 5.6?), there has been no driver for the CR50 TPM in the mainline kernel, and even now it's not selected by default IIRC. One needs a recent kernel with the CR50 TPM driver enabled to mitigate this. or to run UEFI firmware which doesn't implement Google's verified boot idiocy for non-ChromeOS booting.

elthariel commented 3 years ago

@MrChromebox Thanks for answering so quickly 💌 .

Does you answer imply that with your coreboot build, there isn't such issue ?

Also, I don't understand how this TPM previous boot confirmation fits with the crc check failure / version mismatch I've seen in the code ? see the comment here

I'll give the latest mainline kernel a try, but in the meantime, my current solution of using the chromium os 4.4 kernel tree looks very promising (it doesn't work using the lid trigger, maybe the fw is interfering here or it's just configuration, and wifi isn't working when I resume)

MrChromebox commented 3 years ago

@elthariel not implying, definitively stating. CR50 devices running my UEFI firmware do not have this issue.

the TPM boot status is part of the vb2 security data, IIRC. I've not looked at this recently, but had discussed the issue with Google engineers back when it first surfaced. Either the legacy-booted OS needs to set the boot state in the TPM, or the firmware needs to not check it when resuming from suspend on the legacy boot path. The latter was originally how Google was going to handle dual booting Windows but then that got scrapped.

elthariel commented 3 years ago

@MrChromebox Very nice to know. I'll update my firmware to your build once the famous SuzyQ cable gets available.

In the meantime I'll check for the cr50 support in the mainline kernel, or keep using the chrome platform kernel where the cr50 tpm is supported.

That being said, I'm still a tiny bit skeptic about the explanation given by the Google engineer. I'll try to dig a bit deeper to see if I can get a clearer picture, assuming the code in platform/vboot_reference of their repo is the one actually used on my machine :)

elthariel commented 3 years ago

So, suspend/resume (and wifi) is working nicely on my Syndra board, I'll update this issue with the details:

I've been using the chromium os kernel tree:

The kernel config can be found here: https://gist.github.com/elthariel/d9f8dd2528cf36627c63555c4b7a3275#file-config-5-4-73-lta-6

To make this kernel work properly, you might need to fetch a few firmwares from the chromeos partition, or the upstream repos.

elthariel commented 3 years ago

The resume operation sometimes doesn't work correctly when the suspend trigger was the lid close. When you open it back, sometimes the screen is considered disconnected. This small systemd-sleep hack fixes it:

https://gist.github.com/elthariel/d9f8dd2528cf36627c63555c4b7a3275#file-00_wake_up_screen-sh

me11203sci commented 3 years ago

@elthariel Did you end up installing MrChromebox's firmware or where you able to find a work around with the other kernel? I have a Kaby Lake machine (Lenovo C340-15) with similar hang-ups to the ones you described and was wondering if you could give more detail as to what you did to fix it on your machine. I am pretty new to computing in general and would like to better understand what exactly the kernel fix works.