AMDESE / AMDSEV

AMD Secure Encrypted Virtualization
272 stars 84 forks source link

ccp error on SEV initialization #184

Closed ya0guang closed 7 months ago

ya0guang commented 9 months ago

Hi,

I'm trying to follow the Deployment Guide for Confidential Computing to set up the environment for GPU TEE with AMD SEV-SNP. However, I experienced an issue when setting up the host environment for SEV-SNP.

I followed the document's host kernel building guide, using the ssv-snp-devel branch. The error output is:

[  116.117976] ccp 0000:01:00.5: sev command 0x85 timed out, disabling PSP
[  116.118400] ccp 0000:01:00.5: SEV-SNP: failed to INIT error 0x0
[  116.122736] ccp 0000:01:00.5: SEV: failed to INIT error 0xffffffff, rc -16
[  116.129866] ccp 0000:01:00.5: SEV API:1.55 build:21
[  116.146857] SVM: TSC scaling supported
[  116.146862] kvm: Nested Virtualization enabled
[  116.146863] SVM: kvm: Nested Paging enabled
[  116.146865] SEV supported: 907 ASIDs
[  116.146866] SEV-ES and SEV-SNP supported: 99 ASIDs
[  116.147248] SVM: Virtual VMLOAD VMSAVE supported
[  116.147248] SVM: Virtual GIF supported
[  116.147249] SVM: LBR virtualization supported

It looks like ccp timed out at a specific SEV command. I also tried to rmmod and modprob ccp, the error output is different now:

[12348.199727] ccp 0000:01:00.5: sev enabled
[12348.199734] ccp 0000:01:00.5: psp enabled
[12348.200201] ccp 0000:83:00.5: psp enabled
[12353.200983] ccp 0000:01:00.5: sev command 0x4 timed out, disabling PSP
[12353.201018] ccp 0000:01:00.5: SEV: failed to get status. Error: 0x0
[12358.969529] SVM: TSC scaling supported
[12358.969533] kvm: Nested Virtualization enabled
[12358.969534] SVM: kvm: Nested Paging enabled
[12358.969536] SEV supported: 907 ASIDs
[12358.969537] SEV-ES and SEV-SNP supported: 99 ASIDs

More information about the server:

kernel version: 5.19.0-rc6-snp-host-c4daeffce56e

$ lscpu

❯ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-31,33-63
  Off-line CPU(s) list:  32
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 9124 16-Core Processor
    CPU family:          25
    Model:               17
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         3713.0000
    CPU min MHz:         0.0000
    BogoMIPS:            5991.23
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constan
                         t_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave a
                         vx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpex
                         t perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm
                          rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
                          cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flus
                         hbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vn
                         ni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d sev sev_es sev_snp
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    32 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-15,33-47
  NUMA node1 CPU(s):     16-31,48-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

GPU: NVIDIA H100 PCIe version

Could you please help me on the ccp issue? Thanks!

tlendacky commented 9 months ago

That version/branch of the SNP tree is old and your SEV firmware is relatively new. There were changes to the firmware that required certain support in the CCP driver to be modified. Also, how much memory is installed on this system?

Can you try building a SNP host kernel based on the snp-host-latest branch (https://github.com/AMDESE/linux/tree/snp-host-latest)? This will tell us if the SNP_INIT_EX command works or not on your system. I'm not sure if that will help here, since this branch uses the new gmem support and I don't know of the Nvidia support is compatible or not.

Note, we recommend updating all components (kernels, OVMF and Qemu) to the levels specified in the stable-commits file of the AMDSEV package, snp-latest branch (https://github.com/AMDESE/AMDSEV/tree/snp-latest) and using the launch-qemu.sh script to run the guest or a Qemu command based on that generated by the launch-qemu.sh script. Otherwise, you can have other compatibility issues.

ya0guang commented 9 months ago

Thanks a lot for your support, @tlendacky! The total memory is 376G on our system.

Yes, I found that the SNP tree is old. However, currently, NVIDIA's document specifies that:

● The AMD SEV-SNP tree is continually evolving in sync with the kernel version. ● The only supported AMD SEV SNP branch for HCC use with KVM is the sev-snp-devel branch. ● For GPU support, you will need to apply two patches to the 5.19-rc6 kernel

Because the patches are applied to the sev-snp-devel branch to support GPU TEE, I guess switching to the latest branch may cause some problems.

Nevertheless, I can try to build and boot the latest branch. Hopefully, it can provide more information about the CCP error, which may also be helpful to resolve the current CCP error.

tlendacky commented 9 months ago

For GPU support, you will need to apply two patches to the 5.19-rc6 kernel

These patches are related to IOMMU page faults when doing device passthrough on that level of the kernel, which I believe are eliminated/unnecessary under the gmem version of the kernel. Worth a shot to try out the new versions and see how it goes.

zvonkok commented 9 months ago

@tlendacky AFAIK those patches are available in the 6.3.x version of the Kernel, the 6.2 are still missing those two. The snp-latest branch is at 6.2.x something right?

@ya0guang You should make sure that you're using all the SW components from the sev-snp-devel branch and firstly make sure your host is working.

zvonkok commented 9 months ago

@ya0guang Did you check your BIOS settings and have the latest and greatest flashed?

ya0guang commented 9 months ago

@zvonkok TLDR: Yes, it's the latest, and SEV BIOS options are set as specified in the README.

More details

I'm using ESC8000A-E12 | ASUS Servers and Workstations server and the server has already been updated to the latest BIOS 0803 and firmware 1.1.38

However, it's unclear to me if the BIOS has support for the SNP/ccp functionalities, as it was built on 05/26/2023. Please kindly let me know if there is any further information I can provide.

Screenshot of the server's control panel

Screenshot 2023-09-19 at 11 24 20

Screenshot of the server's support website

Screenshot 2023-09-19 at 11 24 50

menonsamir commented 9 months ago

@tlendacky AFAIK those patches are available in the 6.3.x version of the Kernel, the 6.2 are still missing those two. The snp-latest branch is at 6.2.x something right?

FWIW, the kernel being used in snp-latest seems to be 6.5-rc2, and it has the changes to the kernel that were included as patches provided in the deployment guide (see here). I haven't actually tested with the NVIDIA driver yet, though.

hiroki-chen commented 7 months ago

That version/branch of the SNP tree is old and your SEV firmware is relatively new. There were changes to the firmware that required certain support in the CCP driver to be modified. Also, how much memory is installed on this system?

Can you try building a SNP host kernel based on the snp-host-latest branch (https://github.com/AMDESE/linux/tree/snp-host-latest)? This will tell us if the SNP_INIT_EX command works or not on your system. I'm not sure if that will help here, since this branch uses the new gmem support and I don't know of the Nvidia support is compatible or not.

Note, we recommend updating all components (kernels, OVMF and Qemu) to the levels specified in the stable-commits file of the AMDSEV package, snp-latest branch (https://github.com/AMDESE/AMDSEV/tree/snp-latest) and using the launch-qemu.sh script to run the guest or a Qemu command based on that generated by the launch-qemu.sh script. Otherwise, you can have other compatibility issues.

Unfortunately we have tested the latest kernel (from branch snp-latest) but cpp driver still timed out when sending 0x85 (SNP_INIT_EX command).

[  126.228774] kernel: ccp 0000:01:00.5: sev command 0x85 timed out, disabling PSP
[  126.229894] kernel: ccp 0000:01:00.5: SEV-SNP: failed to INIT rc -110, error 0x0
[  126.230962] kernel: ccp 0000:01:00.5: SEV: failed to INIT error 0xffffffff, rc -16
[  126.231923] kernel: ccp 0000:01:00.5: SEV API:1.55 build:5
$ uname -a
Linux gputee 6.6.0-rc1-snp-host-5a170ce1a082 #2 SMP Tue Dec  5 16:02:36 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
ya0guang commented 7 months ago

The problem is solved by changing IOMMU from enabled to auto. FYI, we're using ASUS server, and the problem originates from this line in kernel.

It seems like IOMMU must be configured after SNP on some mainboards.

Thanks a lot for your help!