Open wessel-novacustom opened 1 month ago
I think I know what the problem is...
It might have been observing it on my MSI desktop too. So, after the suspend, the CPU features seem not to be programmed equally on all cores. Then after a reboot, the register with the mismatched features (IA32_FEATURE_CONTROL IIRC), when programmed, causes an exception in coreboot, which halts the boot process.
Not sure if this is still relevant, but worth to pay attention to if somebody will start tackling this issue.
CC: @marmarek
For IA32_FEATURE_CONTROL I do see a difference: before suspend 0x5, after suspend 0x7. Can this cause #GP
? I'd expect enabling extra feature shouldn't...
But also I see a difference: during first boot, VMX is already enabled and locked (by firmware I assume). But on resume, it's Xen who need to enable and lock it.
That said, my guess is the above difference is not the cause of the crash on reboot.
Copying from the other issue message collected on MSI:
(XEN) Hardware Dom0 shutdown: rebooting machine
CPU Index 0 - APIC 0 Unexpected Exception:13 @ 10:46f2b5e1 - Halting
Code: 0 eflags: 00010002 cr2: 00000000
eax: 00000005 ebx: 46f210a0 ecx: 0000003a edx: 00000000
edi: 00000005 esi: 00000000 ebp: 469a19f8 esp: 469a19dc
0x46f2b5a0: 46 ff 50 14 59 59 c3 55
0x46f2b5a8: 8b ec 83 ec 14 56 57 8b
0x46f2b5b0: f2 89 4d fc 8b 4d fc 0f
0x46f2b5b8: 32 ff 75 10 8b ce ff 75
0x46f2b5c0: 0c 52 8b 55 08 50 e8 9e
0x46f2b5c8: 00 00 00 8b f8 8b f2 83
0x46f2b5d0: c4 10 89 7d f0 89 75 f4
0x46f2b5d8: 8b 55 f4 8b 45 f0 8b 4d
0x46f2b5e0: fc 0f 30 8b c7 8b d6 5f
0x46f2b5e8: 5e c9 c3 53 56 57 6a fe
0x46f2b5f0: 8b da 8b f1 8b 4c 24 14
0x46f2b5f8: 83 c8 ff 8b 54 24 1c 5f
0x46f2b600: d3 e7 8b cb d3 e0 f7 d7
0x46f2b608: 23 c7 d3 e2 f7 d0 23 d7
0x46f2b610: 5f 23 c6 5e 0b c2 5b c3
0x46f2b618: 55 8b ec 83 ec 10 83 4d
0x469a1a58: 0x46ba00b0
0x469a1a54: 0x01000000
0x469a1a50: 0x00000006
0x469a1a4c: 0x46f24000
0x469a1a48: 0x00000007
0x469a1a44: 0x46f210a0
0x469a1a40: 0x0000000b
0x469a1a3c: 0x46f2ba25
0x469a1a38: 0x469a1a60
0x469a1a34: 0x00000000
0x469a1a30: 0x00000000
0x469a1a2c: 0x46f25000
0x469a1a28: 0x46f2b93b
0x469a1a24: 0x46f2920b
0x469a1a20: 0x00000100
0x469a1a1c: 0x469a1a38
0x469a1a18: 0x46f2920b
0x469a1a14: 0x46f25000
0x469a1a10: 0x46f25000
0x469a1a0c: 0x46f26000
0x469a1a08: 0x00000000
0x469a1a04: 0x00000000
0x469a1a00: 0x00000001
0x469a19fc: 0x46f291f8
0x469a19f8: 0x469a1a58 <-ebp
0x469a19f4: 0x0000003a
0x469a19f0: 0xffffffff
0x469a19ec: 0x00000000
0x469a19e8: 0x00000005
0x469a19e4: 0x00000003
0x469a19e0: 0x00000006
0x469a19dc: 0x46f24000 <-esp
That code: 0
doesn't look right... Does the stack give any hints what it tried to execute?
0: 55 push ebp
1: 8b ec mov ebp,esp
3: 83 ec 14 sub esp,0x14
6: 56 push esi
7: 57 push edi
8: 8b f2 mov esi,edx
a: 89 4d fc mov DWORD PTR [ebp-0x4],ecx
d: 8b 4d fc mov ecx,DWORD PTR [ebp-0x4]
10: 0f 32 rdmsr
12: ff 75 10 push DWORD PTR [ebp+0x10]
15: 8b ce mov ecx,esi
17: ff 75 0c push DWORD PTR [ebp+0xc]
1a: 52 push edx
1b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
1e: 50 push eax
1f: e8 9e 00 00 00 call 0xc2
24: 8b f8 mov edi,eax
26: 8b f2 mov esi,edx
28: 83 c4 10 add esp,0x10
2b: 89 7d f0 mov DWORD PTR [ebp-0x10],edi
2e: 89 75 f4 mov DWORD PTR [ebp-0xc],esi
31: 8b 55 f4 mov edx,DWORD PTR [ebp-0xc]
34: 8b 45 f0 mov eax,DWORD PTR [ebp-0x10]
37: 8b 4d fc mov ecx,DWORD PTR [ebp-0x4]
3a: 0f 30 wrmsr <-- rip
3c: 8b c7 mov eax,edi
3e: 8b d6 mov edx,esi
40: 5f pop edi
41: 5e pop esi
42: c9 leave
43: c3 ret
44: 53 push ebx
45: 56 push esi
46: 57 push edi
47: 6a fe push 0xfffffffe
49: 8b da mov ebx,edx
4b: 8b f1 mov esi,ecx
4d: 8b 4c 24 14 mov ecx,DWORD PTR [esp+0x14]
51: 83 c8 ff or eax,0xffffffff
54: 8b 54 24 1c mov edx,DWORD PTR [esp+0x1c]
58: 5f pop edi
59: d3 e7 shl edi,cl
5b: 8b cb mov ecx,ebx
5d: d3 e0 shl eax,cl
5f: f7 d7 not edi
61: 23 c7 and eax,edi
63: d3 e2 shl edx,cl
65: f7 d0 not eax
67: 23 d7 and edx,edi
69: 5f pop edi
6a: 23 c6 and eax,esi
6c: 5e pop esi
6d: 0b c2 or eax,edx
6f: 5b pop ebx
70: c3 ret
wrmsr
to 0x3A (IA32_FEATURE_CONTROL
) with lock bit and "Enable VMX outside SMX operation" set. SDM has a note about the latter:
BIOS must set this bit only when the CPUID function 1 returns the VMX feature flag set (ECX bit 5).
Does it maybe try to disable "Enable VMX in SMX operation." (0x2) when lock bit is already set?
Values read from that MSR would be on stack just below esp
(esp-0xC
and esp-0x10
, if I'm counting it properly), but that isn't visible. But IIUC any write to this register when lock bit is set will result in #GP, even if it doesn't flip any bits.
My point is - maybe firmware see 0x2 enabled and try to disable it (even when its already locked) - which can't work.
Which MSI and which release is that output from?
Dasharo (coreboot+UEFI) v0.9.1 on PRO Z790-P WIFI (MS-7E06)
I can't see a code like than in coreboot. UEFI and Xen would most likely work in 64b, which leaves FSP. I wasn't able to find a byte string exactly like that, but there are some similar ones in FSP-M, unfortunately license doesn't allow disassembling so I can't check more.
Does anyone know if this can be reproduced outside of Qubes OS/Xen?
Native Linux seems to set this MSR to 0x5 on resume
BTW, native Linux on resume complains:
[ 731.245103] x86/tme: configuration is inconsistent between CPUs
[ 731.245105] x86/tme: MKTME is not usable
Is there an option in Dasharo to enable "Enable VMX in SMX operation" bit too? I can't find it in the menu... SMX should be available on NV41, no? And CPUID on MSI claims it's there too.
FSP has only one VmxEnable parameter, I have no idea whether this applies to SMX as well. It is possible that it depends on one of the other settings, or some state that differs between cold boot and resume from S3.
Anyway, firmware writing to the feature control MSR when it's already locked it clearly a bug. And also, enabling (and locking) it during boot but not resume also sounds suspicious. If it's done during boot, it should be also done during resume.
Everything is just as I wrote in my first comment.
Unfortunately FSP and coreboot cannot agree who should initialize the IA32_FEATURE_CONTROL
. I had lots of trouble with this register when trying to get Intel TXT to work properly on NV4x. The last release (v0.9.1) was in January, but the right settings for CPU feature programming has been determined by me near Xen Summit in June (when I was showcasing TXT on 10th gen HW, and later on Qubes Summit on NV4x 12th gen). So it obviously won't work until a new release is out at least.
VMX in SMX should not be enabled by coreboot nor FSP on MSI (there is no need to, because chipset does not support TXT).
The last release (v0.9.1) was in January, but the right settings for CPU feature programming has been determined by me near Xen Summit in June (when I was showcasing TXT on 10th gen HW, and later on Qubes Summit on NV4x 12th gen). So it obviously won't work until a new release is out at least.
Does it mean this issue will can be solved on NV4x specifically by "simply" doing new Dasharo release?
VMX in SMX should not be enabled by coreboot nor FSP on MSI (there is no need to, because chipset does not support TXT).
Yeah, but also, it shouldn't try to disable it when the lock bit is set... If IA32_FEATURE_CONTROL
is left for Xen to set, it will enable VMX in SMX based on CPUID only (see xen/arch/x86/hvm/vmx/vmcs.c:_vmx_cpu_up()
)
Does it mean this issue will can be solved on NV4x specifically by "simply" doing new Dasharo release?
Maybe, maybe not. Regular boot works well. The S3 resume path is problematic here and should be investigated because, clearly, something is not happening as should, compared to normal boot.
If IA32_FEATURE_CONTROL is left for Xen to set
No, it definitely isn't.
If IA32_FEATURE_CONTROL is left for Xen to set
No, it definitely isn't.
On resume from S3 it is...
If IA32_FEATURE_CONTROL is left for Xen to set
No, it definitely isn't.
On resume from S3 it is...
coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry. So if Xen required IA32_FEATURE_CONTROL to be unlocked on S3 resume, but not on normal boot path, then something is wrong. Locked IA32_FEATURE_CONTROL is also a prerequisite for TXT initialization.
Yeah, but also, it shouldn't try to disable it when the lock bit is set.
Yeah, coreboot is well aware of that and won't attempt to do so: https://github.com/coreboot/coreboot/blob/main/src/cpu/intel/common/common_init.c#L40 https://github.com/coreboot/coreboot/blob/main/src/cpu/intel/common/common_init.c#L91
However, FSP is not that smart... It always blindly initializes the MSR as if it is the first entity touching it. Here is the code used by FSP to program CPU features: https://github.com/tianocore/edk2/blob/master/UefiCpuPkg/Library/CpuCommonFeaturesLib/CpuCommonFeaturesLib.c#L97
Despite there is a condition to write the feature bit before the lock is set, it is only used to sort the operations in proper order, not to prevent any writes if the lock is already set. So each feature registered calls the appropriate support function and initialize function. Then the macro CPU_REGISTER_TABLE_TEST_THEN_WRITE_FIELD
write the bit fields if they are not set in the initialize function. But it doesn't look at the lock bit, until the actual lock is set in LockFeatureControlRegisterInitialize
, which happens as last step, after all functions that write to IA32_FEATURE_CONTROL.
It is huge PITA. Lost weeks worth of time figuring it out... Then there is also the TME stuff which isn't properly programmed at S3 resume as you already noticed. Another PITA I haven't looked into yet.
coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry.
I understand what you say. You talk about theory. I added debug print on resume path in Xen and clearly seen lock bit not set there. I talk about practice.
coreboot or FSP won't let you leave IA32_FEATURE_CONTROL unlocked neither on normal boot nor S3 resume, sorry.
I understand what you say. You talk about theory. I added debug print on resume path in Xen and clearly seen lock bit not set there. I talk about practice.
Ohh so that's what happen... Interesting... That would mean FSP is not programming it on S3 resume...
@marmarek CPU feature programming on S3 resume was disabled by default until 4 months ago... https://github.com/tianocore/edk2/commit/b7db4d895a8c06d124bec185f824fddfaa556450
That explains what you see (and why FSP didn't program it)... :face_exhaling:
Component
Dasharo firmware, EC firmware
Device
NovaCustom NV4x 12th Gen
Dasharo version
v1.7.2
Dasharo Tools Suite version
No response
Test case ID
No response
Brief summary
Qubes OS doesn't reboot after S3 suspend-to-RAM was triggered and needs a forced reboot.
How reproducible
100% reproducible.
How to reproduce
Expected behavior
The laptop reboots.
Actual behavior
The screen remains black and the laptop doesn't restart.
The laptop needs a forced restart by holding the power button until it switches off, then start again by pressing the power button normally.
Screenshots
https://github.com/user-attachments/assets/126a6d3a-3763-40ad-b8bb-2c94a594a067
Additional context
The issue was submitted to Qubes OS as well: https://github.com/QubesOS/qubes-issues/issues/9511
The issue doesn't seem to happen on Ubuntu and Fedora.
Solutions you've tried
Marek tried the
reboot=acpi
GRUB-option without luck.