H100+TDX：The runtime measurements are not matching with the golden measurements at the following indexes

hedj17 commented 3 months ago

I previously encountered issue #58 , but it was properly resolved by refreshing the GPU firmware and upgrading the VBIOS version. When I attempted to authenticate the GPU by running python3 -m verifier.cc_admin --allow_hold_cert， the following issue occurred.

Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
        Driver version fetched : 550.90.07
        VBIOS version fetched : 96.00.74.00.1c
        Validating GPU certificate chains.
                The firmware ID in the device certificate chain is matching with the one in the attestation report.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 550.90.07
                VBIOS version fetched from the attestation report : 96.00.74.00.1c
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        1,
                        2,
                        3,
                        9,
                        21,
                        22,
                        31,
                        34
                        ]
        GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
        GPU Attestation failed

when I ran nvidia-smi conf-compute -srs 1 to make gpu cc ready， I got follow meassage: Failed to set Conf. Compute GPUs Ready State: Invalid Argument

I allso checked nvidia-persistenced.service. When I ran systemctl status nvidia-persistenced.service command, I received the following message.

nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: active (running) since Fri 2024-07-05 15:51:58 UTC; 16h ago
    Process: 876 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 880 (nvidia-persiste)
      Tasks: 1 (limit: 37233)
     Memory: 36.8M
        CPU: 6.167s
     CGroup: /system.slice/nvidia-persistenced.service
             └─880 /usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose

Jul 05 15:51:51 ubuntu systemd[1]: Starting NVIDIA Persistence Daemon...
Jul 05 15:51:51 ubuntu nvidia-persistenced[880]: Verbose syslog connection opened
Jul 05 15:51:51 ubuntu nvidia-persistenced[880]: Now running with user ID 113 and group ID 121
Jul 05 15:51:51 ubuntu nvidia-persistenced[880]: Started (880)
Jul 05 15:51:51 ubuntu nvidia-persistenced[880]: device 0000:00:01.0 - registered
Jul 05 15:51:58 ubuntu nvidia-persistenced[880]: device 0000:00:01.0 - Failed to enable UVM Persistence mode: 0x40
Jul 05 15:51:58 ubuntu nvidia-persistenced[880]: device 0000:00:01.0 - persistence mode enabled.
Jul 05 15:51:58 ubuntu nvidia-persistenced[880]: device 0000:00:01.0 - NUMA memory onlined.

In addition, when I ran nvidia-persistenced, the following issue occured:. nvidia-persistenced failed to initialize. Check syslog for more details

I checked the kernel message, the wrong is

[   14.687406] nvidia-uvm: Loaded the UVM driver, major device number 234.
[   15.071292] audit: type=1400 audit(1720194710.286:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=775 comm="apparmor_parser"
[   15.071520] audit: type=1400 audit(1720194710.286:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=780 comm="apparmor_parser"
[   15.071524] audit: type=1400 audit(1720194710.286:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=780 comm="apparmor_parser"
[   15.071528] audit: type=1400 audit(1720194710.286:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=780 comm="apparmor_parser"
[   15.071600] audit: type=1400 audit(1720194710.286:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=776 comm="apparmor_parser"
[   15.071622] audit: type=1400 audit(1720194710.286:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=776 comm="apparmor_parser"
[   15.090168] audit: type=1400 audit(1720194710.302:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=782 comm="apparmor_parser"
[   15.090174] audit: type=1400 audit(1720194710.302:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=782 comm="apparmor_parser"
[   15.127555] audit: type=1400 audit(1720194710.342:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ubuntu_pro_apt_news" pid=778 comm="apparmor_parser"
[   15.128211] audit: type=1400 audit(1720194710.342:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=781 comm="apparmor_parser"
[   17.664421] openvswitch: Open vSwitch switching datapath
[   20.239803] loop3: detected capacity change from 0 to 8
[   22.667424] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[   22.667433] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[   23.268539] NVRM: calculatePCIELinkRateMBps: Unknown PCIe speed
[   23.268545] NVRM: getPCIELinkRateMBps: getPCIELinkRateMBps:1778: Generic Error: Invalid state [NV_ERR_INVALID_STATE]

The related information I can think of is as follows.

Linux version 6.2.0-mvp10v1+8-generic
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt vfio-pci.disable_idle_d3=1"
I use libvirt to manage virtual machines instead of qemu.

The script I use to bind vfio is as follows.

set -x
rmmod nvidia_drm
rmmod nvidia_modeset
rmmod nvidia_uvm
rmmod nvidia
lspci -d 10de: -k
echo 0000:99:00.0 > /sys/bus/pci/drivers/nvidia/unbind
echo vfio-pci > /sys/bus/pci/devices/0000\:99\:00.0/driver_override
echo 0000:99:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
echo > /sys/bus/pci/devices/0000\:99\:00.0/driver_override
echo 1048576 > /sys/module/vfio_iommu_type1/parameters/dma_entry_limit
lspci -d 10de: -k

hiroki-chen commented 3 months ago

This is an old issue perhaps due to bugs in their implementation. See also #28.

hedj17 commented 1 month ago

The failure to boot was caused by not correctly adding OVMF to the virtual machine's XML file.

hedj17 commented 3 weeks ago

you can add

<os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <loader>/usr/share/qemu/OVMF.fd</loader>
    <boot dev='hd'/>
  </os>

NVIDIA / nvtrust

H100+TDX：The runtime measurements are not matching with the golden measurements at the following indexes #60