Closed seungsoo-lee closed 6 months ago
Error 0xe refers to WBINVD_REQUIRED
, meaning WBINVD has not been executed.
It's a bit weird, because from the code, every SEV_CMD_SNP_DF_FLUSH
is followed by wbinvd_on_all_cpus
.
My suggestion: update the SEV firmware to latest. The firmware could be downloaded at https://www.amd.com/en/developer/sev.html. EPYC 9224 is Genoa series.
Reference:
Please provide the output of: cpuid -1 -r -l 0x8000001f
Hi @tlendacky
the output is as follows;
CPU: 0x8000001f 0x00: eax=0x030ffffb ebx=0x000041b3 ecx=0x000003ee edx=0x00000064
Hi @Tan-YiFan,
When I tried to update its firmware,
It says that Hardware error: fetal ...
btw, How can I check the firmware version?
@seungsoo-lee To show the firmware version, you can refer to https://github.com/virtee/snphost
@Tan-YiFan
it says
Running `target/debug/snphost ok`
[ PASS ] - AMD CPU [ PASS ] - Microcode support [ PASS ] - Secure Memory Encryption (SME) [ PASS ] - Secure Encrypted Virtualization (SEV) [ PASS ] - Encrypted State (SEV-ES) [ PASS ] - Secure Nested Paging (SEV-SNP) [ PASS ] - VM Permission Levels [ PASS ] - Number of VMPLs: 4 [ PASS ] - Physical address bit reduction: 6 [ PASS ] - C-bit location: 51 [ PASS ] - Number of encrypted guests supported simultaneously: 1006 [ PASS ] - Minimum ASID value for SEV-enabled, SEV-ES disabled guest: 100 [ PASS ] - Reading /dev/sev: /dev/sev readable [ PASS ] - Writing /dev/sev: /dev/sev writable [ PASS ] - Page flush MSR: DISABLED [ PASS ] - KVM supported: API version: 12 [ PASS ] - SEV enabled in KVM: enabled [ PASS ] - SEV-ES enabled in KVM: enabled [ FAIL ] - SEV-SNP enabled in KVM: Error - /sys/module/kvm_amd/parameters/sev_snp does not exist [ PASS ] - Memlock resource limit: Soft: 8351006720 | Hard: 8351006720 ERROR: One or more tests in sevctl-ok reported a failure Error: One or more tests in sevctl-ok reported a failure
could you advise me what's the problem? (btw, where is AMD EPYC firmware version in that?)
[ FAIL ] - SEV-SNP enabled in KVM: Error - /sys/module/kvm_amd/parameters/sev_snp does not exist
Do you have a kernel with SNP hypervisor support built and booted?
@seungsoo-lee It should be snphost show version
. Reference: https://github.com/virtee/snphost/blob/main/docs/snphost.1.adoc
But the problem would be not booting with the newly-installed kernel?
@Tan-YiFan
I cannot see the Genoa firmware version..
$ sudo ./snphost show version ERROR: unable to retrieve SNP platform status Error: unable to retrieve SNP platform status
Caused by: Known(IoError(Os { code: 22, kind: InvalidInput, message: "Invalid argument" }))
Yes. My ultimate goal is to run confidential computing (H100 with AMD EPYC 9224) following NVIDIA development documents. I conducted some tests to figure out the problem.
However,
ccp 0000:09:00.5: SEV: failed to INIT error 0xe or ccp 0000:09:00.5: SEV-SNP: failed to INIT error 0x3
Hi @tlendacky
as mentioned above, when I use the kernel (5.19-snp-awarded) with SNP hypervisor support built, it gets stuck..
It's very hard to say what is happening. The return code 0xe indicates that a WBINVD is required. Are you limiting the number of CPUs via the command line? Or disabling SMT via the command line?
Please capture the full dmesg output and attach as a file here or put in a pastebin site to view.
Thanks for the reply, @tlendacky
How to check 'disabling SMT via the command line?' ?
For cpu,
$ sudo dmidecode -t processor | grep 'Socket Designation' Socket Designation: P0 Socket Designation: P1
It seems that the dual cpus work fine.
Please capture the full dmesg output and attach as a file here or put in a pastebin site to view. --> ok i will.
@tlendacky
I have attached 2 dmesg files, which are based default BIOS setting as follows;
BIOS setting: SMEE --> Auto SEV-ES ASID Space Limit --> 1 SNP Memory Coveragy --> Auto IOMMU --> Enabled SEV-SNP support --> Auto
First one (init.txt) --> Right after installing Ubuntu 22.04.
Second one(5.19.0-rc6-snp) --> On the Ubuntu 22.04, right after installing the new kernel 5.19.0-rc6-snp-host-c4daeffce56e,
Especially, when I filter 'error' from the second one, the results are as follows;
$ sudo dmesg | grep -i error
[ 11.854445] ERST: Error Record Serialization Table (ERST) support is initialized.
[ 12.304912] BERT: Error records from previous boot:
[ 12.310623] [Hardware Error]: event severity: fatal
[ 12.316415] [Hardware Error]: Error 0, type: fatal
[ 12.322230] [Hardware Error]: fru_text: SmnError
[ 12.328098] [Hardware Error]: section type: unknown, a2860cc1-8987-4b7c-b86a-d508b176ba70
[ 12.334233] [Hardware Error]: section length: 0x8
[ 12.340393] [Hardware Error]: 00000000: 0000000f 010102e0 ........
[ 12.394923] RAS: Correctable Errors collector initialized.
@seungsoo-lee It can be implied from the dmesg you provided that the SMT is enabled. One AMD 9224 is 24core, 48threads. The log shows 96 CPUs on you 2-socket server, so SMT is enabled.
Could you provide the dmesg file with SEV-SNP enabled in BIOS?
@tlendacky
I have attached 2 dmesg files, which are based default BIOS setting as follows;
BIOS setting: SMEE --> Auto SEV-ES ASID Space Limit --> 1 SNP Memory Coveragy --> Auto IOMMU --> Enabled SEV-SNP support --> Auto
I need the dmesg output for when BIOS is configured to enable SEV/SNP. Please set
@Tan-YiFan
The attached file is the output, whcih are based SEV-SNP 'Enabled', others are default.
@tlendacky
Once those 4 values are set, the boot step gets stuck.. as shown in the following screenshot.
@seungsoo-lee Could you provide the dmesg with all 4 values are set? In this setting:
BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled After booting, dmesg shows as follows root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs
If the server is equipped with ipmi/idrac, you could use ipmitool
to get the dmesg even if the server gets stuck.
@Tan-YiFan
@seungsoo-lee Could you provide the dmesg with all 4 values are set? In this setting:
BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled After booting, dmesg shows as follows root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs
The above output shows running on the default kernel 5.15.
But, if I installed the snp-aware kernel (https://github.com/AMDESE/AMDSEV/tree/sev-snp-devel), the booting gets stuck.
In summary,
SEV-SNP/SMEE enabled + 5.15 --> booting is okay but we have 'ccp 0000:09:00.5: SEV: failed to INIT error 0xe'.
SEV-SNP/SMEE enabled + 5.19 --> booting gets stuck.
It seems that the server support ipmitool
, but when the booting gets stuck. how can I get the dmesg?
@seungsoo-lee Run ipmitool
a remote machine: ipmitool -I lanplus -H <ipmi_host_ip or hostname> -U <user> -P <password> sol activate | tee -a boot.log
. Run this command on remote machine that can connect to
@Tan-YiFan @tlendacky
I'm really thankful for your advice.
Actually, I have tried many times to install NVIDIA drivers and snp-aware kernel. Maybe I guess there is something wrong with NVIDIA H100. Thus, I told the merchandiser, and they would retrive the workstation and examine.
After that, I will retry if there are same problem pops up.
[ 0.799716] smp: Bringing up secondary CPUs ...
[ 0.799738] x86: Booting SMP configuration:
[ 0.799740] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[ 0.858249] .... node #1, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[ 0.967833] .... node #0, CPUs: #48
[ 10.967719] smpboot: do_boot_cpu failed(-1) to wakeup CPU#48
[ 10.967962] #49
[ 10.970077] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[ 10.970077] #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71
[ 11.021917] .... node #1, CPUs: #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95
[ 11.079894] smp: Brought up 2 nodes, 95 CPUs
This would explain the issue with a WBINVD not being performed. For some reason CPU #48 fails to come up. Since it doesn't come up, the WBINVD is not performed on it and the SEV firmware sees that and fails the INIT.
I'm not sure why that CPU is failing to boot, but that needs to be corrected first.
System/Motherboard: gigabyte G493-ZB2 (G493-ZB2 (rev. AAP1) | GPU Servers - GIGABYTE) OS: Ubuntu 22.04 server CPU: AMD EPYC 9224 x 2 GPU: NVIDIA H100 RAM: 64G / SSD: 2T
BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled
After booting, dmesg shows as follows
root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs
How to fix it?