hiroki-chen commented 11 months ago

Hi,

Thanks for supporting confidential computing on H100 GPUs! This work is wonderful.

I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that the attestation validation went awry.

My machine's specs:

CPU: Dual AMD EPYC 9124 16-Core Processor GPU: H100 10de:2331 (vbios: 96.00.74.00.1A cuda: 12.2 nvidia driver: 535.86.10) Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

I tried to run /attestation_sdk/tests/LocalGPUTest.py but encountered the following error:

h100@h100-cvm:/shared/nvtrust/guest_tools/attestation_sdk/tests$ python3 ./LocalGPUTest.py 
[LocalGPUTest] node name : thisNode1
[['LOCAL_GPU_CLAIMS', <Devices.GPU: 2>, <Environment.LOCAL: 2>, '', '', '']]
[LocalGPUTest] call attest() - expecting True
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce specified by user
VERIFYING GPU : 0
        Driver version fetched : 535.86.10
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.86.10
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9
                        ]
The verification of GPU 0 resulted in failure.
        GPU Attestation failed
False
[LocalGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJOVi1BdHRlc3RhdGlvbi1TREsiLCJpYXQiOjE3MDIwMDUxNTcsImV4cCI6bnVsbH0._J81r7wl6FiVF3uxZL5mKeKuOWPxBsb6-zgdpZ5TJdA"], {"LOCAL_GPU_CLAIMS": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ4LW52LWdwdS1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWluZm8tZmV0Y2hlZCI6dHJ1ZSwieC1udi1ncHUtYXJjaC1jaGVjayI6dHJ1ZSwieC1udi1ncHUtcm9vdC1jZXJ0LWF2YWlsYWJsZSI6dHJ1ZSwieC1udi1ncHUtY2VydC1jaGFpbi12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtb2NzcC1jZXJ0LWNoYWluLXZlcmlmaWVkIjp0cnVlLCJ4LW52LWdwdS1vY3NwLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtY2VydC1vY3NwLW5vbmNlLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1jZXJ0LWNoZWNrLWNvbXBsZXRlIjp0cnVlLCJ4LW52LWdwdS1tZWFzdXJlbWVudC1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnYtZ3B1LW5vbmNlLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtZHJpdmVyLXZlcnNpb24tbWF0Y2giOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC12Ymlvcy12ZXJzaW9uLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtdmVyaWZpZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLWZldGNoZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udi1ncHUtZHJpdmVyLXJpbS1jZXJ0LWV4dHJhY3RlZCI6dHJ1ZSwieC1udi1ncHUtZHJpdmVyLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tZHJpdmVyLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWRyaXZlci12Ymlvcy1yaW0tZmV0Y2hlZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnYtZ3B1LXZiaW9zLXJpbS1jZXJ0LWV4dHJhY3RlZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlLCJ4LW52LWdwdS12Ymlvcy1pbmRleC1uby1jb25mbGljdCI6dHJ1ZSwieC1udi1ncHUtbWVhc3VyZW1lbnRzLW1hdGNoIjpmYWxzZSwieC1udi1ncHUtdXVpZCI6IkdQVS1kNDNlYWM4Zi02MzExLTk1ZTgtYjI4ZS04ZGE1ZmQ1ZTE4MGIifQ.g0ktblAfvDGsfaMuFMxn8MJb3KZPK-7fyoZWrBIuSuY"}]
[LocalGPUTest] call validate_token() - expecting True
        [ERROR] Invalid token. Authorized claims does not match the appraisal policy:  x-nv-gpu-measurements-match
False

The error is x-nv-gpu-measurements-match with

        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9
                        ]

The output of the CC mode on the host machine looks like below.

$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings

NVIDIA GPU Tools version 535.86.06
Topo:
  PCI 0000:60:01.1 0x1022:0x14ab
   GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000
2023-12-07,22:13:09.865 INFO     Selected GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000
2023-12-07,22:13:09.865 WARNING  GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 has CC mode on, some functionality may not work
2023-12-07,22:13:09.936 INFO     GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 CC settings:
2023-12-07,22:13:09.937 INFO       enable = 1
2023-12-07,22:13:09.937 INFO       enable-devtools = 0
2023-12-07,22:13:09.937 INFO       enable-allow-inband-control = 1
2023-12-07,22:13:09.937 INFO       enable-devtools-allow-inband-control = 1

I also tried to set the cc-mode to devtools but it didn't help.

Do you have any ideas on the error? Any help is more than appreciated!

YurkoWasHere commented 10 months ago

I found that when using this command line with the latest commit to enable the nvidia inside the SEV enclave

(I think this is similar to what your doing with LocalGPUTest.py) python3 -m verifier.cc_admin --allow_hold_cert

I started to get an error like yours:

            The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
            [
            9
            ]

Rolling back to the commit I mentioned corrected this problem.

also dev-tools should not be on.

YurkoWasHere commented 10 months ago

To enable CC i use this line python3 gpu_cc_tool.py --gpu-name H100-PCIE --reset-after-cc-mode-switch --set-cc-mode=on

However on my system I been having issues initially with it. Suggestion was to use sysfs instead of devmem but the command line argument seemed to be broken.

My work around was to edit

/shared/nvtrust/host_tools/python/gpu_cc_tool.py

and replace this line

mmio_access_type = "devmem"

with this line

mmio_access_type = "sysfs"

then run gpu_cc_tool.py as above

seungsoo-lee commented 10 months ago

@YurkoWasHere

in my case, with dev-tools off, LocalGPUTest.py outputs true (no errors). then, is it okay to use the latest commit?

seungsoo-lee commented 10 months ago

@YurkoWasHere @hiroki-chen

Did you guys run

nvidia-kata-manager-jvlrq
nvidia-sandbox-device-plugin-daemonset-pq87l
nvidia-sandbox-validator-ppfck
nvidia-vfio-manager-rplgt

successfuly in your k8s cluster ?

hiroki-chen commented 10 months ago

@YurkoWasHere @hiroki-chen

Did you guys run
nvidia-kata-manager-jvlrq
nvidia-sandbox-device-plugin-daemonset-pq87l
nvidia-sandbox-validator-ppfck
nvidia-vfio-manager-rplgt
successfuly in your k8s cluster ?

Hi @seungsoo-lee, sadly I didn't try with K8S cluster.

hiroki-chen commented 10 months ago

@YurkoWasHere

in my case, with dev-tools off, LocalGPUTest.py outputs true (no errors). then, is it okay to use the latest commit?

@seungsoo-lee If you do not care about the bugs introduced by the latest commit, you can then just use it as long as the stable commit shows the correct attestation result.

hiroki-chen commented 10 months ago

@YurkoWasHere @seungsoo-lee

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

$ lspci -d 10de: 
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

# 1.2.0: OK 1.1.0: Fail

$ lspci -d 10de: 
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

# 1.2.0: Fail 1.1.0: OK

seungsoo-lee commented 10 months ago

@hiroki-chen

If you do not try with k8s, how to do confidential computing workloads/examples?

seungsoo-lee commented 10 months ago

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

what command did you try?

hiroki-chen commented 10 months ago

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

what command did you try?

python3 -m verifier.cc_admin --user_mode --allow_hold_cert

hiroki-chen commented 10 months ago

@hiroki-chen

If you do not try with k8s, how to do confidential computing workloads/examples?

I just tried to run PyTorch examples inside the VM. If one performs ML tasks successfully then confidential computing functionalities are enabled.

YurkoWasHere commented 10 months ago

We use pytorch as well directory not in a container. Containerized work loads will be the next step.

On the guest you can check the gpu is in the correct state by running nvidia-smi conf-compute -grs

# nvidia-smi conf-compute -grs

Confidential Compute GPUs Ready state: ready

You can also force the GPU into a ready state without running the attestation by using -srs, but without a valid attestation you cant be confident about the environment.

thisiskarthikj commented 10 months ago

@YurkoWasHere @hiroki-chen

Sorry, late to the party here. The summary is that for the following combinations of driver and vBIOS versions, we get index "9" mismatch, correct ?

    Driver version : 535.86.10
    VBIOS version : 96.00.74.00.1a

    Driver version : 535.104.05
VBIOS version : 96.00.74.00.1c

Is this happening with the latest commit whereas you don't see this issue with an older commit ? Once you confirm, we will try to re-create the setup and try it and will get back to you. Hang in there !

hiroki-chen commented 10 months ago

@YurkoWasHere @hiroki-chen

Sorry, late to the party here. The summary is that for the following combinations of driver and vBIOS versions, we get index "9" mismatch, correct ?
    Driver version : 535.86.10
    VBIOS version : 96.00.74.00.1a

    Driver version : 535.104.05
VBIOS version : 96.00.74.00.1c
Is this happening with the latest commit whereas you don't see this issue with an older commit ? Once you confirm, we will try to re-create the setup and try it and will get back to you. Hang in there !

@thisiskarthikj

Technically it is (for the single-GPU case), but one thing that appears weird to me was that I was able to attest the H100 GPU using the latest commit:

h100@h100-cvm:~$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
        Driver version fetched : 535.104.05
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.104.05
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are matching with the golden measurements.                            
                GPU is in expected state.
        GPU 0 verified successfully.
        GPU Attested Successfully

This happened after I installed another GPU on the machine:

$ lspci -d 10de:
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

And I let the first GPU (41:00.0) bind to VFIO and enabled CC for it.

sudo python3 gpu_cc_tool.py --gpu-bdf=41:00 --set-cc-mode on --reset-after-cc-mode-switch

1.2.0 could work for this scenario but not 1.1.0:

h100@h100-cvm:/shared/nvtrust/guest_tools/attestation_sdk$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
using the new pinned root cert
VERIFYING GPU : 0
        Driver version fetched : 535.104.05
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.104.05
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
                The certificate chain revocation status verification was not successful but continuing.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
                The certificate chain revocation status verification was not successful but continuing.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9, 
                        36
                        ]
The verification of GPU 0 resulted in failure.
        GPU Attestation failed

Summary

Multi-GPU

SDK 1.2.0: Works for multiple GPUs (although I used only one GPU inside the CVM) SDK 1.1.0: Measurement mismatch for index 9 and 36.

Single-GPU

SDK 1.2.0: Measurement mismatch for index 9 SDK 1.1.0: Works fine.

The GPU worked on CC mode.

seungsoo-lee commented 10 months ago

@hiroki-chen @YurkoWasHere

I'm little confused. So, you mean that

first, on the guest VM, the attestation test should be executed and the outputs should be True.
second, on the guest VM, after installing pytorch, and run one of the ML sample codes (is it okay to run very simple code?)
third, after the code runs, Confidential Compute GPUs Ready state becomes Ready.
fourth, k8s workloads can be successfully deployed.

hiroki-chen commented 10 months ago

@hiroki-chen @YurkoWasHere

I'm little confused. So, you mean that

first, on the guest VM, the attestation test should be executed and the outputs should be True.

second, on the guest VM, after installing pytorch, and run one of the ML sample codes (is it okay to run very simple code?)

third, after the code runs, Confidential Compute GPUs Ready state becomes Ready.

fourth, k8s workloads can be successfully deployed.

Steps 1-3 are meant for preparatory purposes: to confirm that the GPU works in CC mode.

after the code runs, Confidential Compute GPUs Ready state becomes Ready.

Not necessarily (the script might be buggy or you are running in user mode but setting to Ready requires admin privilege). You could enable it manually via

sudo nvidia-smi conf-compute -srs 1

YurkoWasHere commented 10 months ago

Still have the same issue with latest commit :(

root@(none):/init.d# python3 -m verifier.cc_admin --allow_hold_cert
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
    Driver version fetched : 535.104.05
    VBIOS version fetched : 96.00.74.00.1c
    Validating GPU certificate chains.
        GPU attestation report certificate chain validation successful.
            The certificate chain revocation status verification successful.
    Authenticating attestation report
        The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
        Driver version fetched from the attestation report : 535.104.05
        VBIOS version fetched from the attestation report : 96.00.74.00.1c
        Attestation report signature verification successful.
        Attestation report verification successful.
    Authenticating the RIMs.
        Authenticating Driver RIM
            Fetching the driver RIM from the RIM service.
            RIM Schema validation passed.
            driver RIM certificate chain verification successful.
            The certificate chain revocation status verification successful.
            driver RIM signature verification successful.
            Driver RIM verification successful
        Authenticating VBIOS RIM.
            Fetching the VBIOS RIM from the RIM service.
            RIM Schema validation passed.
            vbios RIM certificate chain verification successful.
            The certificate chain revocation status verification successful.
            vbios RIM signature verification successful.
            VBIOS RIM verification successful
    Comparing measurements (runtime vs golden)
            The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
            [
            9
            ]
    GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
    GPU Attestation failed
root@(none):/init.d#

hiroki-chen commented 10 months ago

Still have the same issue with latest commit :(

root@(none):/init.d# python3 -m verifier.cc_admin --allow_hold_cert
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
  Driver version fetched : 535.104.05
  VBIOS version fetched : 96.00.74.00.1c
  Validating GPU certificate chains.
      GPU attestation report certificate chain validation successful.
          The certificate chain revocation status verification successful.
  Authenticating attestation report
      The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
      Driver version fetched from the attestation report : 535.104.05
      VBIOS version fetched from the attestation report : 96.00.74.00.1c
      Attestation report signature verification successful.
      Attestation report verification successful.
  Authenticating the RIMs.
      Authenticating Driver RIM
          Fetching the driver RIM from the RIM service.
          RIM Schema validation passed.
          driver RIM certificate chain verification successful.
          The certificate chain revocation status verification successful.
          driver RIM signature verification successful.
          Driver RIM verification successful
      Authenticating VBIOS RIM.
          Fetching the VBIOS RIM from the RIM service.
          RIM Schema validation passed.
          vbios RIM certificate chain verification successful.
          The certificate chain revocation status verification successful.
          vbios RIM signature verification successful.
          VBIOS RIM verification successful
  Comparing measurements (runtime vs golden)
          The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
          [
          9
          ]
  GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
  GPU Attestation failed
root@(none):/init.d#

@YurkoWasHere I believe they haven't updated yet

yunbo-xufeng commented 5 months ago

Hi,

Is this problem resolved? I met exactly the same issue with the latest commit:

Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
    Driver version fetched : 535.104.05
    VBIOS version fetched : 96.00.74.00.1f
    Validating GPU certificate chains.
        GPU attestation report certificate chain validation successful.
            The certificate chain revocation status verification successful.
    Authenticating attestation report
        The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
        Driver version fetched from the attestation report : 535.104.05
        VBIOS version fetched from the attestation report : 96.00.74.00.1f
        Attestation report signature verification successful.
        Attestation report verification successful.
    Authenticating the RIMs.
        Authenticating Driver RIM
            Fetching the driver RIM from the RIM service.
            RIM Schema validation passed.
            driver RIM certificate chain verification successful.
            The certificate chain revocation status verification successful.
            driver RIM signature verification successful.
            Driver RIM verification successful
        Authenticating VBIOS RIM.
            Fetching the VBIOS RIM from the RIM service.
            RIM Schema validation passed.
            vbios RIM certificate chain verification successful.
            The certificate chain revocation status verification successful.
            vbios RIM signature verification successful.
            VBIOS RIM verification successful
    Comparing measurements (runtime vs golden)
            The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
            [
            9
            ]
    GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
    GPU Attestation failed

YurkoWasHere commented 5 months ago

Unfortunately i no longer have the H100 paired with an AMD, but moved on to Intel which required the newest version. Last i checked (about a month ago) it still was not working

If you still are having issues try using the older commit git checkout 4383b82

yunbo-xufeng commented 5 months ago

Unfortunately i no longer have the H100 paired with an AMD, but moved on to Intel which required the newest version. Last i checked (about a month ago) it still was not working

If you still are having issues try using the older commit git checkout 4383b82

Actually, I'm not using AMD, my environment is Intel+TDX CVM，GPU is H800。 And I also run the remote test, looks like gpu measurement is also not matched:

[RemoteGPUTest] node name : thisNode1
[['REMOTE_GPU_CLAIMS', <Devices.GPU: 2>, <Environment.REMOTE: 5>, 'https://nras.attestation.nvidia.com/v1/attest/gpu', '', '']]
[RemoteGPUTest] call attest() - expecting True
generate_evidence
Fetching GPU 0 information from GPU driver.
Calling NRAS to attest GPU evidence...
**** Attestation Successful ****
Entity Attestation Token is eyJraWQiOiJudi1lYXQta2lkLXByb2QtMjAyNDA2MDQyMzU4NDY4NTgtZjk4MjYwYzYtZmVlOC00ZTU3LWJlMDEtMTliNWE1YTkwNTc0IiwiYWxnIjoiRVMzODQifQ.eyJzdWIiOiJOVklESUEtR1BVLUFUVEVTVEFUSU9OIiwic2VjYm9vdCI6dHJ1ZSwieC1udmlkaWEtZ3B1LW1hbnVmYWN0dXJlciI6Ik5WSURJQSBDb3Jwb3JhdGlvbiIsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXR5cGUiOiJHUFUiLCJpc3MiOiJodHRwczovL25yYXMuYXR0ZXN0YXRpb24ubnZpZGlhLmNvbSIsImVhdF9ub25jZSI6IjkzMUQ4REQwQUREMjAzQUMzRDhCNEZCREU3NUUxMTUyNzhFRUZDRENFQUM1Qjg3NjcxQTc0OEYzMjM2NERGQ0IiLCJ4LW52aWRpYS1hdHRlc3RhdGlvbi1kZXRhaWxlZC1yZXN1bHQiOnsieC1udmlkaWEtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1jZXJ0LXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtbWlzbWF0Y2gtbWVhc3VyZW1lbnQtcmVjb3JkcyI6W3siaW5kZXgiOjksImdvbGRlblNpemUiOjQ4LCJnb2xkZW5WYWx1ZSI6IjA1OWIzMmU3MTJhMTUzZjQ5MGRiZmI3OTc2YTllMjc1ZDc4OWUyOGJkNDgwM2MzNTdkZWYyYjYxMjMzMjdjNDMwNTI2YmZhZWNjMjAwZjQ5NmQ0ZTE0OWZjNWVhZGUwMyIsInJ1bnRpbWVTaXplIjo0OCwicnVudGltZVZhbHVlIjoiN2YzZTkzODI3ODU1MTNjMTkzMmRmY2M5ZTg3ZjZlZjZiZjVmZWZlODgxNDRjNmVhNDg1MzllNjVmOTM3MDEzZGQ3MzQ5MTQ0ZTVmNDM5ZGNlYTQwMWRhYzI2ZTVjMDk4In1dLCJ4LW52aWRpYS1ncHUtYXR0ZXN0YXRpb24tcmVwb3J0LWNlcnQtY2hhaW4tdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtZHJpdmVyLXJpbS1zY2hlbWEtZmV0Y2hlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnZpZGlhLWdwdS1ub25jZS1tYXRjaCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWFyY2gtY2hlY2siOnRydWUsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXdhcm5pbmciOm51bGwsIngtbnZpZGlhLWdwdS1tZWFzdXJlbWVudHMtbWF0Y2giOmZhbHNlLCJ4LW52aWRpYS1taXNtYXRjaC1pbmRleGVzIjpbOV0sIngtbnZpZGlhLWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtc2lnbmF0dXJlLXZlcmlmaWVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWNlcnQtdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS1mZXRjaGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlfSwieC1udmlkaWEtdmVyIjoiMS4wIiwibmJmIjoxNzE3NTczOTg5LCJ4LW52aWRpYS1ncHUtZHJpdmVyLXZlcnNpb24iOiI1MzUuMTA0LjA1IiwiZGJnc3RhdCI6ImRpc2FibGVkIiwiaHdtb2RlbCI6IkdIMTAwIEEwMSBHU1AgQlJPTSIsIm9lbWlkIjoiNTcwMyIsIm1lYXNyZXMiOiJjb21wYXJpc29uLWZhaWwiLCJleHAiOjE3MTc1Nzc1ODksImlhdCI6MTcxNzU3Mzk4OSwieC1udmlkaWEtZWF0LXZlciI6IkVBVC0yMSIsInVlaWQiOiI0NzkxNTk0NTUzMTI5NDk4MDg5NzcwMzQwMTMxMzA5NTEzNzI5MDc0Mjc3OTc4NzYiLCJ4LW52aWRpYS1ncHUtdmJpb3MtdmVyc2lvbiI6Ijk2LjAwLjc0LjAwLjFGIiwianRpIjoiODQ1ZWU4ZDQtN2IzMS00M2E3LWI1NjQtNmI4ZGNjMmVkY2JmIn0.5ecQ6aopvHsTuCXN9tqfmZKVTAB4VzW5auoNgwVlSeGNbqXoSm8PmEsRmQLO6btjeyTOV-iNixJnDqjbuNjuR8_qRw5uWwLAZUd-cJAwLYjmOPPKObJbDF1H8TalDOC2
True
[RemoteGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJOVi1BdHRlc3RhdGlvbi1TREsiLCJpYXQiOjE3MTc1NzM5ODUsImV4cCI6bnVsbH0.2UbZe2h_TLQTad2QjDIKIXuaSiTpDM7oj3nlhRNlqqY"], {"REMOTE_GPU_CLAIMS": "eyJraWQiOiJudi1lYXQta2lkLXByb2QtMjAyNDA2MDQyMzU4NDY4NTgtZjk4MjYwYzYtZmVlOC00ZTU3LWJlMDEtMTliNWE1YTkwNTc0IiwiYWxnIjoiRVMzODQifQ.eyJzdWIiOiJOVklESUEtR1BVLUFUVEVTVEFUSU9OIiwic2VjYm9vdCI6dHJ1ZSwieC1udmlkaWEtZ3B1LW1hbnVmYWN0dXJlciI6Ik5WSURJQSBDb3Jwb3JhdGlvbiIsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXR5cGUiOiJHUFUiLCJpc3MiOiJodHRwczovL25yYXMuYXR0ZXN0YXRpb24ubnZpZGlhLmNvbSIsImVhdF9ub25jZSI6IjkzMUQ4REQwQUREMjAzQUMzRDhCNEZCREU3NUUxMTUyNzhFRUZDRENFQUM1Qjg3NjcxQTc0OEYzMjM2NERGQ0IiLCJ4LW52aWRpYS1hdHRlc3RhdGlvbi1kZXRhaWxlZC1yZXN1bHQiOnsieC1udmlkaWEtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1jZXJ0LXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtbWlzbWF0Y2gtbWVhc3VyZW1lbnQtcmVjb3JkcyI6W3siaW5kZXgiOjksImdvbGRlblNpemUiOjQ4LCJnb2xkZW5WYWx1ZSI6IjA1OWIzMmU3MTJhMTUzZjQ5MGRiZmI3OTc2YTllMjc1ZDc4OWUyOGJkNDgwM2MzNTdkZWYyYjYxMjMzMjdjNDMwNTI2YmZhZWNjMjAwZjQ5NmQ0ZTE0OWZjNWVhZGUwMyIsInJ1bnRpbWVTaXplIjo0OCwicnVudGltZVZhbHVlIjoiN2YzZTkzODI3ODU1MTNjMTkzMmRmY2M5ZTg3ZjZlZjZiZjVmZWZlODgxNDRjNmVhNDg1MzllNjVmOTM3MDEzZGQ3MzQ5MTQ0ZTVmNDM5ZGNlYTQwMWRhYzI2ZTVjMDk4In1dLCJ4LW52aWRpYS1ncHUtYXR0ZXN0YXRpb24tcmVwb3J0LWNlcnQtY2hhaW4tdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtZHJpdmVyLXJpbS1zY2hlbWEtZmV0Y2hlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnZpZGlhLWdwdS1ub25jZS1tYXRjaCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWFyY2gtY2hlY2siOnRydWUsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXdhcm5pbmciOm51bGwsIngtbnZpZGlhLWdwdS1tZWFzdXJlbWVudHMtbWF0Y2giOmZhbHNlLCJ4LW52aWRpYS1taXNtYXRjaC1pbmRleGVzIjpbOV0sIngtbnZpZGlhLWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtc2lnbmF0dXJlLXZlcmlmaWVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWNlcnQtdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS1mZXRjaGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlfSwieC1udmlkaWEtdmVyIjoiMS4wIiwibmJmIjoxNzE3NTczOTg5LCJ4LW52aWRpYS1ncHUtZHJpdmVyLXZlcnNpb24iOiI1MzUuMTA0LjA1IiwiZGJnc3RhdCI6ImRpc2FibGVkIiwiaHdtb2RlbCI6IkdIMTAwIEEwMSBHU1AgQlJPTSIsIm9lbWlkIjoiNTcwMyIsIm1lYXNyZXMiOiJjb21wYXJpc29uLWZhaWwiLCJleHAiOjE3MTc1Nzc1ODksImlhdCI6MTcxNzU3Mzk4OSwieC1udmlkaWEtZWF0LXZlciI6IkVBVC0yMSIsInVlaWQiOiI0NzkxNTk0NTUzMTI5NDk4MDg5NzcwMzQwMTMxMzA5NTEzNzI5MDc0Mjc3OTc4NzYiLCJ4LW52aWRpYS1ncHUtdmJpb3MtdmVyc2lvbiI6Ijk2LjAwLjc0LjAwLjFGIiwianRpIjoiODQ1ZWU4ZDQtN2IzMS00M2E3LWI1NjQtNmI4ZGNjMmVkY2JmIn0.5ecQ6aopvHsTuCXN9tqfmZKVTAB4VzW5auoNgwVlSeGNbqXoSm8PmEsRmQLO6btjeyTOV-iNixJnDqjbuNjuR8_qRw5uWwLAZUd-cJAwLYjmOPPKObJbDF1H8TalDOC2"}]
[RemoteGPUTest] call validate_token() - expecting True
***** Validating Signature using JWKS endpont https://nras.attestation.nvidia.com/.well-known/jwks.json ******
Decoded Token  {
  "sub": "NVIDIA-GPU-ATTESTATION",
  "secboot": true,
  "x-nvidia-gpu-manufacturer": "NVIDIA Corporation",
  "x-nvidia-attestation-type": "GPU",
  "iss": "https://nras.attestation.nvidia.com",
  "eat_nonce": "931D8DD0ADD203AC3D8B4FBDE75E115278EEFCDCEAC5B87671A748F32364DFCB",
  "x-nvidia-attestation-detailed-result": {
    "x-nvidia-gpu-driver-rim-schema-validated": true,
    "x-nvidia-gpu-vbios-rim-cert-validated": true,
    "x-nvidia-mismatch-measurement-records": [
      {
        "index": 9,
        "goldenSize": 48,
        "goldenValue": "059b32e712a153f490dbfb7976a9e275d789e28bd4803c357def2b6123327c430526bfaecc200f496d4e149fc5eade03",
        "runtimeSize": 48,
        "runtimeValue": "7f3e9382785513c1932dfcc9e87f6ef6bf5fefe88144c6ea48539e65f937013dd7349144e5f439dcea401dac26e5c098"
      }
    ],
    "x-nvidia-gpu-attestation-report-cert-chain-validated": true,
    "x-nvidia-gpu-driver-rim-schema-fetched": true,
    "x-nvidia-gpu-attestation-report-parsed": true,
    "x-nvidia-gpu-nonce-match": true,
    "x-nvidia-gpu-vbios-rim-signature-verified": true,
    "x-nvidia-gpu-driver-rim-signature-verified": true,
    "x-nvidia-gpu-arch-check": true,
    "x-nvidia-attestation-warning": null,
    "x-nvidia-gpu-measurements-match": false,
    "x-nvidia-mismatch-indexes": [
      9
    ],
    "x-nvidia-gpu-attestation-report-signature-verified": true,
    "x-nvidia-gpu-vbios-rim-schema-validated": true,
    "x-nvidia-gpu-driver-rim-cert-validated": true,
    "x-nvidia-gpu-vbios-rim-schema-fetched": true,
    "x-nvidia-gpu-vbios-rim-measurements-available": true,
    "x-nvidia-gpu-driver-rim-driver-measurements-available": true
  },
  "x-nvidia-ver": "1.0",
  "nbf": 1717573989,
  "x-nvidia-gpu-driver-version": "535.104.05",
  "dbgstat": "disabled",
  "hwmodel": "GH100 A01 GSP BROM",
  "oemid": "5703",
  "measres": "comparison-fail",
  "exp": 1717577589,
  "iat": 1717573989,
  "x-nvidia-eat-ver": "EAT-21",
  "ueid": "479159455312949808977034013130951372907427797876",
  "x-nvidia-gpu-vbios-version": "96.00.74.00.1F",
  "jti": "845ee8d4-7b31-43a7-b564-6b8dcc2edcbf"
}
***** JWT token signature is valid. *****
    [ERROR] Invalid token. Authorized claims does not match the appraisal policy:  x-nvidia-gpu-measurements-match
False

hiroki-chen commented 5 months ago

@yunbo-xufeng I'm not sure if the old commit works for H800 but H100 is supported :/ Did you try the 4383b82 commit? If you tried with that commit and remote attestation failed then I think you'll probably have to wait for NVIDIA's team to fix this issue.

thisiskarthikj commented 5 months ago

@hiroki-chen @yunbo-xufeng Measurements mismatch could be an issue with RIM file itself. We will take a look and get back to you.

thisiskarthikj commented 5 months ago

@yunbo-xufeng Can you get me the version of nvidia_gpu_tools.py that you are using ?

python3 nvidia_gpu_tools.py --help | grep version

NVIDIA / nvtrust

The runtime measurements are not matching #28

Summary

Multi-GPU

Single-GPU