Closed ya0guang closed 7 months ago
That version/branch of the SNP tree is old and your SEV firmware is relatively new. There were changes to the firmware that required certain support in the CCP driver to be modified. Also, how much memory is installed on this system?
Can you try building a SNP host kernel based on the snp-host-latest branch (https://github.com/AMDESE/linux/tree/snp-host-latest)? This will tell us if the SNP_INIT_EX command works or not on your system. I'm not sure if that will help here, since this branch uses the new gmem support and I don't know of the Nvidia support is compatible or not.
Note, we recommend updating all components (kernels, OVMF and Qemu) to the levels specified in the stable-commits file of the AMDSEV package, snp-latest branch (https://github.com/AMDESE/AMDSEV/tree/snp-latest) and using the launch-qemu.sh script to run the guest or a Qemu command based on that generated by the launch-qemu.sh script. Otherwise, you can have other compatibility issues.
Thanks a lot for your support, @tlendacky! The total memory is 376G on our system.
Yes, I found that the SNP tree is old. However, currently, NVIDIA's document specifies that:
● The AMD SEV-SNP tree is continually evolving in sync with the kernel version. ● The only supported AMD SEV SNP branch for HCC use with KVM is the sev-snp-devel branch. ● For GPU support, you will need to apply two patches to the 5.19-rc6 kernel
Because the patches are applied to the sev-snp-devel
branch to support GPU TEE, I guess switching to the latest branch may cause some problems.
Nevertheless, I can try to build and boot the latest branch. Hopefully, it can provide more information about the CCP error, which may also be helpful to resolve the current CCP error.
For GPU support, you will need to apply two patches to the 5.19-rc6 kernel
These patches are related to IOMMU page faults when doing device passthrough on that level of the kernel, which I believe are eliminated/unnecessary under the gmem version of the kernel. Worth a shot to try out the new versions and see how it goes.
@tlendacky AFAIK those patches are available in the 6.3.x version of the Kernel, the 6.2 are still missing those two. The snp-latest
branch is at 6.2.x something right?
@ya0guang You should make sure that you're using all the SW components from the sev-snp-devel
branch and firstly make sure your host is working.
@ya0guang Did you check your BIOS settings and have the latest and greatest flashed?
@zvonkok TLDR: Yes, it's the latest, and SEV BIOS options are set as specified in the README.
I'm using ESC8000A-E12 | ASUS Servers and Workstations server and the server has already been updated to the latest BIOS 0803 and firmware 1.1.38
However, it's unclear to me if the BIOS has support for the SNP/ccp functionalities, as it was built on 05/26/2023. Please kindly let me know if there is any further information I can provide.
@tlendacky AFAIK those patches are available in the 6.3.x version of the Kernel, the 6.2 are still missing those two. The
snp-latest
branch is at 6.2.x something right?
FWIW, the kernel being used in snp-latest
seems to be 6.5-rc2
, and it has the changes to the kernel that were included as patches provided in the deployment guide (see here). I haven't actually tested with the NVIDIA driver yet, though.
That version/branch of the SNP tree is old and your SEV firmware is relatively new. There were changes to the firmware that required certain support in the CCP driver to be modified. Also, how much memory is installed on this system?
Can you try building a SNP host kernel based on the snp-host-latest branch (https://github.com/AMDESE/linux/tree/snp-host-latest)? This will tell us if the SNP_INIT_EX command works or not on your system. I'm not sure if that will help here, since this branch uses the new gmem support and I don't know of the Nvidia support is compatible or not.
Note, we recommend updating all components (kernels, OVMF and Qemu) to the levels specified in the stable-commits file of the AMDSEV package, snp-latest branch (https://github.com/AMDESE/AMDSEV/tree/snp-latest) and using the launch-qemu.sh script to run the guest or a Qemu command based on that generated by the launch-qemu.sh script. Otherwise, you can have other compatibility issues.
Unfortunately we have tested the latest kernel (from branch snp-latest
) but cpp driver still timed out when sending 0x85
(SNP_INIT_EX
command).
[ 126.228774] kernel: ccp 0000:01:00.5: sev command 0x85 timed out, disabling PSP
[ 126.229894] kernel: ccp 0000:01:00.5: SEV-SNP: failed to INIT rc -110, error 0x0
[ 126.230962] kernel: ccp 0000:01:00.5: SEV: failed to INIT error 0xffffffff, rc -16
[ 126.231923] kernel: ccp 0000:01:00.5: SEV API:1.55 build:5
$ uname -a
Linux gputee 6.6.0-rc1-snp-host-5a170ce1a082 #2 SMP Tue Dec 5 16:02:36 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
The problem is solved by changing IOMMU from enabled
to auto
.
FYI, we're using ASUS server, and the problem originates from this line in kernel.
It seems like IOMMU must be configured after SNP on some mainboards.
Thanks a lot for your help!
Hi,
I'm trying to follow the Deployment Guide for Confidential Computing to set up the environment for GPU TEE with AMD SEV-SNP. However, I experienced an issue when setting up the host environment for SEV-SNP.
I followed the document's host kernel building guide, using the
ssv-snp-devel
branch. The error output is:It looks like
ccp
timed out at a specific SEV command. I also tried tormmod
andmodprob
ccp
, the error output is different now:More information about the server:
kernel version: 5.19.0-rc6-snp-host-c4daeffce56e
GPU: NVIDIA H100 PCIe version
Could you please help me on the ccp issue? Thanks!