Open hiroki-chen opened 11 months ago
I started having the same problem with the recent commits
Seems like the commit 4383b822ca00f80734904d23e0c9c046722274c1 still works
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
Driver version fetched : 535.104.05
VBIOS version fetched : 96.00.74.00.1c
Validating GPU certificate chains.
GPU attestation report certificate chain validation successful.
The certificate chain revocation status verification successful.
Authenticating attestation report
The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
Driver version fetched from the attestation report : 535.104.05
VBIOS version fetched from the attestation report : 96.00.74.00.1c
Attestation report signature verification successful.
Attestation report verification successful.
Authenticating the RIMs.
Authenticating Driver RIM
Fetching the driver RIM from the RIM service.
RIM Schema validation passed.
driver RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
driver RIM signature verification successful.
Driver RIM verification successful
Authenticating VBIOS RIM.
Fetching the VBIOS RIM from the RIM service.
RIM Schema validation passed.
vbios RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
vbios RIM signature verification successful.
VBIOS RIM verification successful
Comparing measurements (runtime vs golden)
The runtime measurements are not matching with the
golden measurements at the following indexes(starting from 0) :
[
9
]
GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
GPU Attestation failed
@thisiskarthikj Seems your commit broke the attestations.
https://github.com/NVIDIA/nvtrust/commit/9ad90fd7fe32b0d6783e60cd76a3c48f4c6eabfc
Thank you for your advice @YurkoWasHere but I tried the old commit and it wouldn't work due to RIM cert revocation.
@hiroki-chen
The certs are "revoked" because this tech is still in preview and not meant for production.
Use the --allow_hold_cert
parameter to bypass this specific revocation type check.
IE:
python3 -m verifier.cc_admin --allow_hold_cert
@hiroki-chen
The certs are "revoked" because this tech is still in preview and not meant for production.
Use the
--allow_hold_cert
parameter to bypass this specific revocation type check.IE:
python3 -m verifier.cc_admin --allow_hold_cert
Thanks! The problem was solved except that I had to run python3
as sudo.
@hiroki-chen
You can try adding --user_mode
for a non-sudo version of the command.
I'm not sure what the difference in attestation is.
@YurkoWasHere I see. Thank you very much for your help :)
Hi @hiroki-chen ,
I am also trying to do confidential computing with H100. Similar to yours, my machine is on dual AMD EPYC 9224 and H100 (running on GIGABYTE systems)
could you give me your BIOS setting about SEV-SNP? because I got stuck on installing kernel phase....
Hi @hiroki-chen ,
I am also trying to do confidential computing with H100. Similar to yours, my machine is on dual AMD EPYC 9224 and H100 (running on GIGABYTE systems)
could you give me your BIOS setting about SEV-SNP? because I got stuck on installing kernel phase....
@seungsoo-lee I followed the instructions in the deployment guide from Nvidia.
The options are listed below.
Advanced -->
AMD CBS ->
CPU Common ->
SEV ASID Count -> 509 ASIDs
SEV-ES ASID space Limit Control -> Manual
SEV-ES ASID space limit -> 100
SNP Memory Coverage -> Enabled
SMEE -> Enabled
NBIO common ->
SEV-SNP Support -> Enabled
IOMMU -> auto
I don't have a V4 AMD but another project ran into issues with the AMD v4s not working with their stack.
This may be relevant https://github.com/AMDESE/AMDSEV/tree/snp-latest?tab=readme-ov-file#upgrading-from-519-based-snp-hypervisorhost-kernels
But i don't think you will be blocked on building and installing kernel by incorrect BIOS settings. In my experience SEV just wont work :)
I would also contact support for the board manufacturer to confirm bios settings for SEV-SNP. Sometimes they need a bios upgrade.
@hiroki-chen
in the guide, IOMMU is enabled. Yours is auto even it works? And, can you let me know what's your BIOS systems (e.g., Supermicro or GIGABYTE..)?
@YurkoWasHere
do you mean that although BIOS provides SEV options, SEV wont work?
BIOS that I'm using prvides some SEV-SNP options as follows.
Advanced --> AMD CBS -> CPU Common -> (not provide) SEV ASID Count -> 509 ASIDs (not provide) SEV-ES ASID space Limit Control -> Manual SEV-ES ASID space limit -> 100 SNP Memory Coverage -> Enabled SMEE -> Enabled NBIO common -> SEV-SNP Support -> Enabled IOMMU -> auto
@seungsoo-lee We are using ASUS workstation: https://servers.asus.com/products/servers/gpu-servers/ESC8000A-E12
Interestingly when we enabled IOMMU the SNP initialization would fail due to "TOO LATE TO ENABLE SNP FOR IOMMU".
@hiroki-chen
if you did not already try adding amd_iommu=on
kernel argument to you grub linux
line?
linux /vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root= [.....] amd_iommu=on
(see /boot/grub/grub.cfg)
@hiroki-chen
if you did not already try adding
amd_iommu=on
kernel argument to you grublinux
line?
linux /vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root= [.....] amd_iommu=on
(see /boot/grub/grub.cfg)
@YurkoWasHere Yes, I tried this before but my system would enter emergency mode immediately after reboot. It was very weird though.
So strange,
Full args im using are
mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev-snp=1 amd_kvm.sev-es=1 amd_iommu=on vfio-pci.disable_idle_d3=1
@hiroki-chen @YurkoWasHere According to kernel-parameters.txt, amd_iommu
does not take "on" as input. Maybe we should add iommu=pt
?
If iommu is not enabled, could you passthrough an H100 GPU to VM?
@Tan-YiFan Yes. I can passthrough an H100 to QEMU for some reason if I set IOMMU to auto (perhaps it is BIOS-specific, I guess).
@Tan-YiFan Yes. I can passthrough an H100 to QEMU for some reason if I set IOMMU to auto (perhaps it is BIOS-specific, I guess).
The output of /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root=UUID=[...] ro vfio-pci.disable_idle_d3=1
hi @hiroki-chen ,
I also tried to follow the Confiential Computing Deployment Guide provided by NVIDIA.
And, now I stucked on installing NVIDIA driver on the guest VM.
it says ERROR: Unable to load the kernel module 'nvidia.ko'.
My machine spec seems similar to yours. Actually, based on the document, when we installed the guest VM, its kernel version is 6.2.0-39-generic.
But you say that your guest OS is as follows; Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel
how to find this information from the document? how to build the guest kernel?
@seungsoo-lee
Thanks for the reply. For building the guest kernel, you may clone this repo at branch sev-snp-devel
and then follow Preparing to Build the Kernel section to build the kernels. You will find the guest kernel under snp-release-[built date]/linux/guest
. Then you launch the CVM, scp the deb packages to it, and install the kernels. Be sure to modify grub so that the 5.19 kernel is selected.
$ sudo vim /etc/default/grub
GRUB_DEFAULT="1>?"
$ cat /boot/grub/grub.cfg | grep menuentry
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082 (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.2.0-39-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.2.0-39-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.15.0-91-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.15.0-91-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
Replace the question mark with the desired kernel version (index starting from 0). Then do update-grub and reboot.
Hope this helps.
@hiroki-chen
thanks for the reply!
as your advice, I have updated 5.19-gurest kernel on the guest VM.
after that,
when I tried to do 'Enabling LKCA on the Guest VM' part on the document (p.26)
sudo update-initramfs -u
says update-initramfs: Generating /boot/initrd.img-6.2.0-39-generic
not 5.19-snp-guest.
how about your case?, is it okay?
@seungsoo-lee
By default, this command will select the latest kernel. In your case, it is 6.2.0. You can select the kernel version manually via
sudo update-initramfs -u -k `uname -r`
or simply
sudo update-initramfs -u -k all
@hiroki-chen
now,
I changed kernel to 5.19-snp-guest as your advice
and sudo update-initramfs -u -k all
also.
But failed to install NVIDIA driver and CUDA again.. it says
make[1]: Leaving directory '/usr/src/linux-headers-5.19.0-rc6-snp-guest-c4daeffce56e'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[ 9.059353] audit: type=1400 audit(1704159783.488:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=705 comm="apparmor_parser"
[ 9.059357] audit: type=1400 audit(1704159783.488:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=705 comm="apparmor_parser"
[ 9.060369] audit: type=1400 audit(1704159783.492:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=702 comm="apparmor_parser"
[ 9.060373] audit: type=1400 audit(1704159783.492:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=702 comm="apparmor_parser"
[ 12.607062] loop3: detected capacity change from 0 to 8
[ 12.607267] Dev loop3: unable to read RDB block 8
[ 12.608258] loop3: unable to read partition table
[ 12.608263] loop3: partition table beyond EOD, truncated
[ 13.245756] fbcon: Taking over console
[ 13.299647] Console: switching to colour frame buffer device 128x48
[ 132.090302] nvidia: loading out-of-tree module taints kernel.
[ 132.092232] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 132.124068] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 132.124076] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
NVRM: installed in this system is not supported by open
NVRM: nvidia.ko because it does not include the required GPU
NVRM: System Processor (GSP).
NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
NVRM: Firmware' sections in the driver README, available on
NVRM: the Linux graphics driver download page at
NVRM: www.nvidia.com.
[ 137.470645] nvidia: probe of 0000:01:00.0 failed with error -1
[ 137.470765] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 137.470768] NVRM: None of the NVIDIA devices were initialized.
[ 137.471172] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
do you have any idea?
@seungsoo-lee Which CUDA version are you currently using? Only v535 is compatible with H100. If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again. I have once encountered this issue but managed to fix it by re-installing the driver.
@hiroki-chen
some confused..
first, cuda_12.2.1_535.86.10_linux.run
should be installed on the host before installing it on the guest VM?
second, if so, which host kernel version should be target?
by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.
@hiroki-chen
some confused..
first,
cuda_12.2.1_535.86.10_linux.run
should be installed on the host before installing it on the guest VM?second, if so, which host kernel version should be target? by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.
@seungsoo-lee No. Installing driver on the host is not required. The motivation of installing driver on the host is to check whether the H100 works fine.
@hiroki-chen some confused.. first,
cuda_12.2.1_535.86.10_linux.run
should be installed on the host before installing it on the guest VM? second, if so, which host kernel version should be target? by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.@seungsoo-lee No. Installing driver on the host is not required. The motivation of installing driver on the host is to check whether the H100 works fine.
@hiroki-chen
you said 'If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again.'
please let me know what commands you used
I have tried to install the NVIDIA driver on the guest VM all day. (remove/reinstall host guest and repeat..)
Finally, I got this
cclab@guest:~$ sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-12.2/
Please make sure that
- PATH includes /usr/local/cuda-12.2/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log
cclab@guest:~$ sudo nvidia-persistenced
cclab@guest:~$ ps -aux | grep nvidia-persistenced
root 10413 19.8 0.0 5320 1840 ? Ss 11:52 0:04 nvidia-persistenced
cclab 10440 0.0 0.0 6612 2404 pts/0 S+ 11:53 0:00 grep --color=auto nvidia-persistenced
cclab@guest:~$ nvidia-smi conf-compute -f
CC status: ON
cclab@guest:~$ nvidia-smi -q | grep VBIOS
VBIOS Version : 96.00.30.00.01
my procedures are as follows.
Installing host kernel --> it is okay. then, after preparing and launching guest VM (ubuntu 22.04.2 as described in the deployment document),
then, installing the NVIDIA driver is succeeded.
HOWEVER,
after rebooting the guest VM,
cclab@guest:~$ sudo nvidia-persistenced
nvidia-persistenced failed to initialize. Check syslog for more details.
cclab@guest:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
@seungsoo-lee The VBIOS version should be at least 96.00.5E.00.00.
By the way, could you give the dmesg of the failure case (after rebooting)?
@seungsoo-lee Some suggestions:
@seungsoo-lee The VBIOS version should be at least 96.00.5E.00.00.
By the way, could you give the dmesg of the failure case (after rebooting)?
..
[ 7.840101] ACPI: \_SB_.PCI0.S20_.S00_: failed to evaluate _DSM
[ 7.840916] nouveau 0000:01:00.0: unknown chipset (badf0200)
[ 8.066947] raid6: avx2x4 gen() 29916 MB/s
...
[ 9.242845] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 9.242851] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 535.86.10 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: in this release's README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
[ 9.248779] nvidia: probe of 0000:01:00.0 failed with error -1
[ 9.248799] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 9.248800] NVRM: None of the NVIDIA devices were initialized.
[ 9.249746] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
- Reset the status of the GPU using host_cc_tools.py on the host
you mean that it should off the cc mode of the H100 on the host? how to reset?
@Tan-YiFan
How to upgrade VBIOS to version 96.00.5E.00.00?
@seungsoo-lee
you mean that it should off the cc mode of the H100 on the host? how to reset?
python3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode off --reset-after-cc-mode-switch
and then python3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode on --reset-after-cc-mode-switch
. Resetting is done with the parameter --reset-after-cc-mode-switch
.
How to upgrade VBIOS to version 96.00.5E.00.00?
Refer to https://forums.developer.nvidia.com/t/firmware-update-on-h100-gpu/263934 . This might take weeks of time. But I suggest reproducing cclab@guest:~$ nvidia-smi conf-compute -f => CC status: ON
first.
@seungsoo-lee
you mean that it should off the cc mode of the H100 on the host? how to reset?
python3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode off --reset-after-cc-mode-switch
and thenpython3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode on --reset-after-cc-mode-switch
. Resetting is done with the parameter--reset-after-cc-mode-switch
.How to upgrade VBIOS to version 96.00.5E.00.00?
Refer to https://forums.developer.nvidia.com/t/firmware-update-on-h100-gpu/263934 . This might take weeks of time. But I suggest reproducing
cclab@guest:~$ nvidia-smi conf-compute -f => CC status: ON
first.
I did it as your advice. But, it is same..
cclab@guest:~$ nvidia-smi conf-compute -f
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
@hiroki-chen
you said 'If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again.'
please let me know what commands you used
@seungsoo-lee
You should be able to find an uninstaller script if you installed the driver from the script:
sudo /usr/local/cuda*/cuda-uninstaller
You may also want to remove drivers that come from apt:
sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo apt-get autoclean
sudo rm -rf /usr/local/cuda-*
@Tan-YiFan
How to upgrade VBIOS to version 96.00.5E.00.00?
@seungsoo-lee You need to contact the vendor to do this :(
@seungsoo-lee
you mean that it should off the cc mode of the H100 on the host? how to reset?
python3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode off --reset-after-cc-mode-switch
and thenpython3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode on --reset-after-cc-mode-switch
. Resetting is done with the parameter--reset-after-cc-mode-switch
.How to upgrade VBIOS to version 96.00.5E.00.00?
Refer to https://forums.developer.nvidia.com/t/firmware-update-on-h100-gpu/263934 . This might take weeks of time. But I suggest reproducing
cclab@guest:~$ nvidia-smi conf-compute -f => CC status: ON
first.
Hi Yifan,
How did you manage to upgrade the VBIOS? Just being curious about the procedure outside U.S.
@hiroki-chen
In your case, after installing the driver/cuda on the guest, you can reboot and see the driver from the guest?
what is your the initial kernel version of the guest vm (22.04.2 Ubuntu)?
How did you manage to upgrade the VBIOS? Just being curious about the procedure outside U.S.
@hiroki-chen I have not managed to upgrade the VBIOS of H100.
@hiroki-chen
- In your case, after installing the driver/cuda on the guest, you can reboot and see the driver from the guest?
- what is your the initial kernel version of the guest vm (22.04.2 Ubuntu)?
@seungsoo-lee
How did you manage to upgrade the VBIOS? Just being curious about the procedure outside U.S.
@hiroki-chen I have not managed to upgrade the VBIOS of H100.
Thanks for the information!
I will update VBIOS first, and then re-try..
@hiroki-chen
Could you let me know whether you did modify prepare.sh
or launch.sh
?
@hiroki-chen
Could you let me know whether you did modify
prepare.sh
orlaunch.sh
?
No
Really Thanks @hiroki-chen and @Tan-YiFan
After I updated VBIOS to 96.00.5E.00.03
, everything works fine.
@hiroki-chen
finally, I faced the problem that you experienced. I run the attestation command in the virtual python environment guided by the doc,
(nvAttest) cclab@guest:/shared/nvtrust/guest_tools/attestation_sdk/tests$ spython ./LocalGPUTest.py
it says
Comparing measurements (runtime vs golden)
The runtime measurements are not matching with the
golden measurements at the following indexes(starting from 0) :
[
5,
9,
32,
36,
37
]
As your solution above (using sudo), it still shows the same error output.
do you have any idea?
I have not tested with the latest commit but I don't think its been fixed.
So try using this commit instead of the latest:
https://github.com/NVIDIA/nvtrust/commit/4383b822ca00f80734904d23e0c9c046722274c1
Hi @YurkoWasHere
you mean when it tries to turn the CC mode on, that [gpu_cc_tool.py] scripts should be used?
plus, I wonder whether we should turn the dev-tools mode on also when setting CC mode on.
Hi,
Thanks for supporting confidential computing on H100 GPUs! This work is wonderful.
I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that the attestation validation went awry.
My machine's specs:
CPU: Dual AMD EPYC 9124 16-Core Processor GPU: H100 10de:2331 (vbios: 96.00.74.00.1A cuda: 12.2 nvidia driver: 535.86.10) Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel
I tried to run
/attestation_sdk/tests/LocalGPUTest.py
but encountered the following error:The error is
x-nv-gpu-measurements-match
withThe output of the CC mode on the host machine looks like below.
I also tried to set the
cc-mode
todevtools
but it didn't help.Do you have any ideas on the error? Any help is more than appreciated!