AMDESE / AMDSEV

AMD Secure Encrypted Virtualization
272 stars 84 forks source link

SEV: failed to INIT error 0xe #202

Closed seungsoo-lee closed 6 months ago

seungsoo-lee commented 6 months ago

System/Motherboard: gigabyte G493-ZB2 (G493-ZB2 (rev. AAP1) | GPU Servers - GIGABYTE) OS: Ubuntu 22.04 server CPU: AMD EPYC 9224 x 2 GPU: NVIDIA H100 RAM: 64G / SSD: 2T

BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled

After booting, dmesg shows as follows

root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs

How to fix it?

Tan-YiFan commented 6 months ago

Error 0xe refers to WBINVD_REQUIRED, meaning WBINVD has not been executed. It's a bit weird, because from the code, every SEV_CMD_SNP_DF_FLUSH is followed by wbinvd_on_all_cpus.

My suggestion: update the SEV firmware to latest. The firmware could be downloaded at https://www.amd.com/en/developer/sev.html. EPYC 9224 is Genoa series.

Reference:

tlendacky commented 6 months ago

Please provide the output of: cpuid -1 -r -l 0x8000001f

seungsoo-lee commented 6 months ago

Hi @tlendacky

the output is as follows;

CPU: 0x8000001f 0x00: eax=0x030ffffb ebx=0x000041b3 ecx=0x000003ee edx=0x00000064

seungsoo-lee commented 6 months ago

Hi @Tan-YiFan,

When I tried to update its firmware,

It says that Hardware error: fetal ...

btw, How can I check the firmware version?

Tan-YiFan commented 6 months ago

@seungsoo-lee To show the firmware version, you can refer to https://github.com/virtee/snphost

seungsoo-lee commented 6 months ago

@Tan-YiFan

it says

 Running `target/debug/snphost ok`

[ PASS ] - AMD CPU [ PASS ] - Microcode support [ PASS ] - Secure Memory Encryption (SME) [ PASS ] - Secure Encrypted Virtualization (SEV) [ PASS ] - Encrypted State (SEV-ES) [ PASS ] - Secure Nested Paging (SEV-SNP) [ PASS ] - VM Permission Levels [ PASS ] - Number of VMPLs: 4 [ PASS ] - Physical address bit reduction: 6 [ PASS ] - C-bit location: 51 [ PASS ] - Number of encrypted guests supported simultaneously: 1006 [ PASS ] - Minimum ASID value for SEV-enabled, SEV-ES disabled guest: 100 [ PASS ] - Reading /dev/sev: /dev/sev readable [ PASS ] - Writing /dev/sev: /dev/sev writable [ PASS ] - Page flush MSR: DISABLED [ PASS ] - KVM supported: API version: 12 [ PASS ] - SEV enabled in KVM: enabled [ PASS ] - SEV-ES enabled in KVM: enabled [ FAIL ] - SEV-SNP enabled in KVM: Error - /sys/module/kvm_amd/parameters/sev_snp does not exist [ PASS ] - Memlock resource limit: Soft: 8351006720 | Hard: 8351006720 ERROR: One or more tests in sevctl-ok reported a failure Error: One or more tests in sevctl-ok reported a failure

could you advise me what's the problem? (btw, where is AMD EPYC firmware version in that?)

tlendacky commented 6 months ago

[ FAIL ] - SEV-SNP enabled in KVM: Error - /sys/module/kvm_amd/parameters/sev_snp does not exist

Do you have a kernel with SNP hypervisor support built and booted?

Tan-YiFan commented 6 months ago

@seungsoo-lee It should be snphost show version. Reference: https://github.com/virtee/snphost/blob/main/docs/snphost.1.adoc But the problem would be not booting with the newly-installed kernel?

seungsoo-lee commented 6 months ago

@Tan-YiFan

I cannot see the Genoa firmware version..

$ sudo ./snphost show version ERROR: unable to retrieve SNP platform status Error: unable to retrieve SNP platform status

Caused by: Known(IoError(Os { code: 22, kind: InvalidInput, message: "Invalid argument" }))


Yes. My ultimate goal is to run confidential computing (H100 with AMD EPYC 9224) following NVIDIA development documents. I conducted some tests to figure out the problem.

However,

ccp 0000:09:00.5: SEV: failed to INIT error 0xe or ccp 0000:09:00.5: SEV-SNP: failed to INIT error 0x3

error1 error2

seungsoo-lee commented 6 months ago

Hi @tlendacky

as mentioned above, when I use the kernel (5.19-snp-awarded) with SNP hypervisor support built, it gets stuck..

tlendacky commented 6 months ago

It's very hard to say what is happening. The return code 0xe indicates that a WBINVD is required. Are you limiting the number of CPUs via the command line? Or disabling SMT via the command line?

Please capture the full dmesg output and attach as a file here or put in a pastebin site to view.

seungsoo-lee commented 6 months ago

Thanks for the reply, @tlendacky

How to check 'disabling SMT via the command line?' ?

For cpu,

$ sudo dmidecode -t processor | grep 'Socket Designation' Socket Designation: P0 Socket Designation: P1

It seems that the dual cpus work fine.

Please capture the full dmesg output and attach as a file here or put in a pastebin site to view. --> ok i will.

seungsoo-lee commented 6 months ago

@tlendacky

I have attached 2 dmesg files, which are based default BIOS setting as follows;

BIOS setting: SMEE --> Auto SEV-ES ASID Space Limit --> 1 SNP Memory Coveragy --> Auto IOMMU --> Enabled SEV-SNP support --> Auto

Especially, when I filter 'error' from the second one, the results are as follows;

$ sudo dmesg | grep -i error
[   11.854445] ERST: Error Record Serialization Table (ERST) support is initialized.
[   12.304912] BERT: Error records from previous boot:
[   12.310623] [Hardware Error]: event severity: fatal
[   12.316415] [Hardware Error]:  Error 0, type: fatal
[   12.322230] [Hardware Error]:  fru_text: SmnError
[   12.328098] [Hardware Error]:   section type: unknown, a2860cc1-8987-4b7c-b86a-d508b176ba70
[   12.334233] [Hardware Error]:   section length: 0x8
[   12.340393] [Hardware Error]:   00000000: 0000000f 010102e0                    ........
[   12.394923] RAS: Correctable Errors collector initialized.

5.19.0-rc6-snp-host.txt init.txt

Tan-YiFan commented 6 months ago

@seungsoo-lee It can be implied from the dmesg you provided that the SMT is enabled. One AMD 9224 is 24core, 48threads. The log shows 96 CPUs on you 2-socket server, so SMT is enabled.

Could you provide the dmesg file with SEV-SNP enabled in BIOS?

tlendacky commented 6 months ago

@tlendacky

I have attached 2 dmesg files, which are based default BIOS setting as follows;

BIOS setting: SMEE --> Auto SEV-ES ASID Space Limit --> 1 SNP Memory Coveragy --> Auto IOMMU --> Enabled SEV-SNP support --> Auto

I need the dmesg output for when BIOS is configured to enable SEV/SNP. Please set

seungsoo-lee commented 6 months ago

@Tan-YiFan

The attached file is the output, whcih are based SEV-SNP 'Enabled', others are default.

sev-snp.txt

@tlendacky

Once those 4 values are set, the boot step gets stuck.. as shown in the following screenshot.

stuck

Tan-YiFan commented 6 months ago

@seungsoo-lee Could you provide the dmesg with all 4 values are set? In this setting:

BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled After booting, dmesg shows as follows root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs

If the server is equipped with ipmi/idrac, you could use ipmitool to get the dmesg even if the server gets stuck.

seungsoo-lee commented 6 months ago

@Tan-YiFan

@seungsoo-lee Could you provide the dmesg with all 4 values are set? In this setting:

BIOS setting: SMEE --> Enabled SEV-ES ASID Space Limit --> 100 SNP Memory Coveragy --> Enabled IOMMU --> Enabled SEV-SNP support --> Enabled After booting, dmesg shows as follows root@ubuntu-h100:~# dmesg | grep -i -e rmp -e sev -e snp [ 21.301847] ccp 0000:09:00.5: sev enabled [ 21.543946] ccp 0000:09:00.5: SEV: failed to INIT error 0xe [ 21.920354] SEV supported: 907 ASIDs [ 21.920355] SEV-ES supported: 99 ASIDs

The above output shows running on the default kernel 5.15.

But, if I installed the snp-aware kernel (https://github.com/AMDESE/AMDSEV/tree/sev-snp-devel), the booting gets stuck.

In summary,

It seems that the server support ipmitool, but when the booting gets stuck. how can I get the dmesg?

Tan-YiFan commented 6 months ago

@seungsoo-lee Run ipmitool a remote machine: ipmitool -I lanplus -H <ipmi_host_ip or hostname> -U <user> -P <password> sol activate | tee -a boot.log. Run this command on remote machine that can connect to and then boot the server. You would get the boot log on the remove machine.

seungsoo-lee commented 6 months ago

@Tan-YiFan @tlendacky

I'm really thankful for your advice.

Actually, I have tried many times to install NVIDIA drivers and snp-aware kernel. Maybe I guess there is something wrong with NVIDIA H100. Thus, I told the merchandiser, and they would retrive the workstation and examine.

After that, I will retry if there are same problem pops up.

tlendacky commented 6 months ago
[    0.799716] smp: Bringing up secondary CPUs ...
[    0.799738] x86: Booting SMP configuration:
[    0.799740] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20  #21  #22  #23
[    0.858249] .... node  #1, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31  #32  #33  #34  #35  #36  #37  #38  #39  #40  #41  #42  #43  #44  #45  #46  #47
[    0.967833] .... node  #0, CPUs:    #48
[   10.967719] smpboot: do_boot_cpu failed(-1) to wakeup CPU#48
[   10.967962]   #49
[   10.970077] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[   10.970077]   #50  #51  #52  #53  #54  #55  #56  #57  #58  #59  #60  #61  #62  #63  #64  #65  #66  #67  #68  #69  #70  #71
[   11.021917] .... node  #1, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79  #80  #81  #82  #83  #84  #85  #86  #87  #88  #89  #90  #91  #92  #93  #94  #95
[   11.079894] smp: Brought up 2 nodes, 95 CPUs

This would explain the issue with a WBINVD not being performed. For some reason CPU #48 fails to come up. Since it doesn't come up, the WBINVD is not performed on it and the SEV firmware sees that and fails the INIT.

I'm not sure why that CPU is failing to boot, but that needs to be corrected first.