Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

Qubes OS doesn't reboot after S3 suspend (NV41PZ) #1097

Open wessel-novacustom opened 1 month ago

wessel-novacustom commented 1 month ago

Component

Dasharo firmware, EC firmware

Device

NovaCustom NV4x 12th Gen

Dasharo version

v1.7.2

Dasharo Tools Suite version

No response

Test case ID

No response

Brief summary

Qubes OS doesn't reboot after S3 suspend-to-RAM was triggered and needs a forced reboot.

How reproducible

100% reproducible.

How to reproduce

  1. Make sure that Qubes OS is installed. This happens on a new clean installation, kernel version 6.6. The Dasharo version is the latest release: v1.7.2.
  2. Set S3 suspend-to-RAM in the UEFI firmware settings. Save the settings with F10 --> Y --> Enter --> Ctrl + Alt + Del.
  3. Boot to Qubes OS.
  4. Trigger S3 suspending in whatever way (closing lid or via GUI).
  5. Reboot the laptop.

Expected behavior

The laptop reboots.

Actual behavior

The screen remains black and the laptop doesn't restart.

The laptop needs a forced restart by holding the power button until it switches off, then start again by pressing the power button normally.

Screenshots

https://github.com/user-attachments/assets/126a6d3a-3763-40ad-b8bb-2c94a594a067

Additional context

The issue was submitted to Qubes OS as well: https://github.com/QubesOS/qubes-issues/issues/9511

The issue doesn't seem to happen on Ubuntu and Fedora.

Solutions you've tried

Marek tried the reboot=acpi GRUB-option without luck.

miczyg1 commented 1 month ago

I think I know what the problem is...

It might have been observing it on my MSI desktop too. So, after the suspend, the CPU features seem not to be programmed equally on all cores. Then after a reboot, the register with the mismatched features (IA32_FEATURE_CONTROL IIRC), when programmed, causes an exception in coreboot, which halts the boot process.

Not sure if this is still relevant, but worth to pay attention to if somebody will start tackling this issue.

wessel-novacustom commented 1 month ago

CC: @marmarek

marmarek commented 1 month ago

For IA32_FEATURE_CONTROL I do see a difference: before suspend 0x5, after suspend 0x7. Can this cause #GP? I'd expect enabling extra feature shouldn't... But also I see a difference: during first boot, VMX is already enabled and locked (by firmware I assume). But on resume, it's Xen who need to enable and lock it.

That said, my guess is the above difference is not the cause of the crash on reboot.

marmarek commented 1 month ago

Copying from the other issue message collected on MSI:

(XEN) Hardware Dom0 shutdown: rebooting machine
CPU Index 0 - APIC 0 Unexpected Exception:13 @ 10:46f2b5e1 - Halting
Code: 0 eflags: 00010002 cr2: 00000000
eax: 00000005 ebx: 46f210a0 ecx: 0000003a edx: 00000000
edi: 00000005 esi: 00000000 ebp: 469a19f8 esp: 469a19dc

0x46f2b5a0: 46 ff 50 14 59 59 c3 55 
0x46f2b5a8: 8b ec 83 ec 14 56 57 8b 
0x46f2b5b0: f2 89 4d fc 8b 4d fc 0f 
0x46f2b5b8: 32 ff 75 10 8b ce ff 75 
0x46f2b5c0: 0c 52 8b 55 08 50 e8 9e 
0x46f2b5c8: 00 00 00 8b f8 8b f2 83 
0x46f2b5d0: c4 10 89 7d f0 89 75 f4 
0x46f2b5d8: 8b 55 f4 8b 45 f0 8b 4d 
0x46f2b5e0: fc 0f 30 8b c7 8b d6 5f 
0x46f2b5e8: 5e c9 c3 53 56 57 6a fe 
0x46f2b5f0: 8b da 8b f1 8b 4c 24 14 
0x46f2b5f8: 83 c8 ff 8b 54 24 1c 5f 
0x46f2b600: d3 e7 8b cb d3 e0 f7 d7 
0x46f2b608: 23 c7 d3 e2 f7 d0 23 d7 
0x46f2b610: 5f 23 c6 5e 0b c2 5b c3 
0x46f2b618: 55 8b ec 83 ec 10 83 4d 
0x469a1a58: 0x46ba00b0
0x469a1a54: 0x01000000
0x469a1a50: 0x00000006
0x469a1a4c: 0x46f24000
0x469a1a48: 0x00000007
0x469a1a44: 0x46f210a0
0x469a1a40: 0x0000000b
0x469a1a3c: 0x46f2ba25
0x469a1a38: 0x469a1a60
0x469a1a34: 0x00000000
0x469a1a30: 0x00000000
0x469a1a2c: 0x46f25000
0x469a1a28: 0x46f2b93b
0x469a1a24: 0x46f2920b
0x469a1a20: 0x00000100
0x469a1a1c: 0x469a1a38
0x469a1a18: 0x46f2920b
0x469a1a14: 0x46f25000
0x469a1a10: 0x46f25000
0x469a1a0c: 0x46f26000
0x469a1a08: 0x00000000
0x469a1a04: 0x00000000
0x469a1a00: 0x00000001
0x469a19fc: 0x46f291f8
0x469a19f8: 0x469a1a58 <-ebp
0x469a19f4: 0x0000003a
0x469a19f0: 0xffffffff
0x469a19ec: 0x00000000
0x469a19e8: 0x00000005
0x469a19e4: 0x00000003
0x469a19e0: 0x00000006
0x469a19dc: 0x46f24000 <-esp

That code: 0 doesn't look right... Does the stack give any hints what it tried to execute?

krystian-hebel commented 1 month ago
0:  55                      push   ebp
1:  8b ec                   mov    ebp,esp
3:  83 ec 14                sub    esp,0x14
6:  56                      push   esi
7:  57                      push   edi
8:  8b f2                   mov    esi,edx
a:  89 4d fc                mov    DWORD PTR [ebp-0x4],ecx
d:  8b 4d fc                mov    ecx,DWORD PTR [ebp-0x4]
10: 0f 32                   rdmsr
12: ff 75 10                push   DWORD PTR [ebp+0x10]
15: 8b ce                   mov    ecx,esi
17: ff 75 0c                push   DWORD PTR [ebp+0xc]
1a: 52                      push   edx
1b: 8b 55 08                mov    edx,DWORD PTR [ebp+0x8]
1e: 50                      push   eax
1f: e8 9e 00 00 00          call   0xc2
24: 8b f8                   mov    edi,eax
26: 8b f2                   mov    esi,edx
28: 83 c4 10                add    esp,0x10
2b: 89 7d f0                mov    DWORD PTR [ebp-0x10],edi
2e: 89 75 f4                mov    DWORD PTR [ebp-0xc],esi
31: 8b 55 f4                mov    edx,DWORD PTR [ebp-0xc]
34: 8b 45 f0                mov    eax,DWORD PTR [ebp-0x10]
37: 8b 4d fc                mov    ecx,DWORD PTR [ebp-0x4]
3a: 0f 30                   wrmsr                           <-- rip
3c: 8b c7                   mov    eax,edi
3e: 8b d6                   mov    edx,esi
40: 5f                      pop    edi
41: 5e                      pop    esi
42: c9                      leave
43: c3                      ret
44: 53                      push   ebx
45: 56                      push   esi
46: 57                      push   edi
47: 6a fe                   push   0xfffffffe
49: 8b da                   mov    ebx,edx
4b: 8b f1                   mov    esi,ecx
4d: 8b 4c 24 14             mov    ecx,DWORD PTR [esp+0x14]
51: 83 c8 ff                or     eax,0xffffffff
54: 8b 54 24 1c             mov    edx,DWORD PTR [esp+0x1c]
58: 5f                      pop    edi
59: d3 e7                   shl    edi,cl
5b: 8b cb                   mov    ecx,ebx
5d: d3 e0                   shl    eax,cl
5f: f7 d7                   not    edi
61: 23 c7                   and    eax,edi
63: d3 e2                   shl    edx,cl
65: f7 d0                   not    eax
67: 23 d7                   and    edx,edi
69: 5f                      pop    edi
6a: 23 c6                   and    eax,esi
6c: 5e                      pop    esi
6d: 0b c2                   or     eax,edx
6f: 5b                      pop    ebx
70: c3                      ret

wrmsr to 0x3A (IA32_FEATURE_CONTROL) with lock bit and "Enable VMX outside SMX operation" set. SDM has a note about the latter:

BIOS must set this bit only when the CPUID function 1 returns the VMX feature flag set (ECX bit 5).

marmarek commented 1 month ago

Does it maybe try to disable "Enable VMX in SMX operation." (0x2) when lock bit is already set?

krystian-hebel commented 1 month ago

Values read from that MSR would be on stack just below esp (esp-0xC and esp-0x10, if I'm counting it properly), but that isn't visible. But IIUC any write to this register when lock bit is set will result in #GP, even if it doesn't flip any bits.

marmarek commented 1 month ago

My point is - maybe firmware see 0x2 enabled and try to disable it (even when its already locked) - which can't work.

krystian-hebel commented 1 month ago

Which MSI and which release is that output from?

marmarek commented 1 month ago

Dasharo (coreboot+UEFI) v0.9.1 on PRO Z790-P WIFI (MS-7E06)

krystian-hebel commented 1 month ago

I can't see a code like than in coreboot. UEFI and Xen would most likely work in 64b, which leaves FSP. I wasn't able to find a byte string exactly like that, but there are some similar ones in FSP-M, unfortunately license doesn't allow disassembling so I can't check more.

Does anyone know if this can be reproduced outside of Qubes OS/Xen?

marmarek commented 1 month ago

Native Linux seems to set this MSR to 0x5 on resume

marmarek commented 1 month ago

BTW, native Linux on resume complains:

[  731.245103] x86/tme: configuration is inconsistent between CPUs
[  731.245105] x86/tme: MKTME is not usable
marmarek commented 1 month ago

Is there an option in Dasharo to enable "Enable VMX in SMX operation" bit too? I can't find it in the menu... SMX should be available on NV41, no? And CPUID on MSI claims it's there too.

krystian-hebel commented 1 month ago

FSP has only one VmxEnable parameter, I have no idea whether this applies to SMX as well. It is possible that it depends on one of the other settings, or some state that differs between cold boot and resume from S3.

marmarek commented 1 month ago

Anyway, firmware writing to the feature control MSR when it's already locked it clearly a bug. And also, enabling (and locking) it during boot but not resume also sounds suspicious. If it's done during boot, it should be also done during resume.

miczyg1 commented 1 month ago

Everything is just as I wrote in my first comment.

Unfortunately FSP and coreboot cannot agree who should initialize the IA32_FEATURE_CONTROL. I had lots of trouble with this register when trying to get Intel TXT to work properly on NV4x. The last release (v0.9.1) was in January, but the right settings for CPU feature programming has been determined by me near Xen Summit in June (when I was showcasing TXT on 10th gen HW, and later on Qubes Summit on NV4x 12th gen). So it obviously won't work until a new release is out at least.

VMX in SMX should not be enabled by coreboot nor FSP on MSI (there is no need to, because chipset does not support TXT).

marmarek commented 1 month ago

The last release (v0.9.1) was in January, but the right settings for CPU feature programming has been determined by me near Xen Summit in June (when I was showcasing TXT on 10th gen HW, and later on Qubes Summit on NV4x 12th gen). So it obviously won't work until a new release is out at least.

Does it mean this issue will can be solved on NV4x specifically by "simply" doing new Dasharo release?

VMX in SMX should not be enabled by coreboot nor FSP on MSI (there is no need to, because chipset does not support TXT).

Yeah, but also, it shouldn't try to disable it when the lock bit is set... If IA32_FEATURE_CONTROL is left for Xen to set, it will enable VMX in SMX based on CPUID only (see xen/arch/x86/hvm/vmx/vmcs.c:_vmx_cpu_up())

miczyg1 commented 1 month ago

Does it mean this issue will can be solved on NV4x specifically by "simply" doing new Dasharo release?

Maybe, maybe not. Regular boot works well. The S3 resume path is problematic here and should be investigated because, clearly, something is not happening as should, compared to normal boot.

If IA32_FEATURE_CONTROL is left for Xen to set

No, it definitely isn't.

marmarek commented 1 month ago

If IA32_FEATURE_CONTROL is left for Xen to set

No, it definitely isn't.

On resume from S3 it is...

miczyg1 commented 1 month ago

If IA32_FEATURE_CONTROL is left for Xen to set

No, it definitely isn't.

On resume from S3 it is...

coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry. So if Xen required IA32_FEATURE_CONTROL to be unlocked on S3 resume, but not on normal boot path, then something is wrong. Locked IA32_FEATURE_CONTROL is also a prerequisite for TXT initialization.

Yeah, but also, it shouldn't try to disable it when the lock bit is set.

Yeah, coreboot is well aware of that and won't attempt to do so: https://github.com/coreboot/coreboot/blob/main/src/cpu/intel/common/common_init.c#L40 https://github.com/coreboot/coreboot/blob/main/src/cpu/intel/common/common_init.c#L91

However, FSP is not that smart... It always blindly initializes the MSR as if it is the first entity touching it. Here is the code used by FSP to program CPU features: https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuCommonFeaturesLib/CpuCommonFeaturesLib.c#L97

Despite there is a condition to write the feature bit before the lock is set, it is only used to sort the operations in proper order, not to prevent any writes if the lock is already set. So each feature registered calls the appropriate support function and initialize function. Then the macro CPU_REGISTER_TABLE_TEST_THEN_WRITE_FIELD write the bit fields if they are not set in the initialize function. But it doesn't look at the lock bit, until the actual lock is set in LockFeatureControlRegisterInitialize, which happens as last step, after all functions that write to IA32_FEATURE_CONTROL.

It is huge PITA. Lost weeks worth of time figuring it out... Then there is also the TME stuff which isn't properly programmed at S3 resume as you already noticed. Another PITA I haven't looked into yet.

marmarek commented 1 month ago

coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry.

I understand what you say. You talk about theory. I added debug print on resume path in Xen and clearly seen lock bit not set there. I talk about practice.

miczyg1 commented 1 month ago

coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry.

I understand what you say. You talk about theory. I added debug print on resume path in Xen and clearly seen lock bit not set there. I talk about practice.

Ohh so that's what happen... Interesting... That would mean FSP is not programming it on S3 resume...

miczyg1 commented 1 month ago

@marmarek CPU feature programming on S3 resume was disabled by default until 4 months ago... https://github.com/tianocore/edk2/commit/b7db4d895a8c06d124bec185f824fddfaa556450

That explains what you see (and why FSP didn't program it)... :face_exhaling: