NVIDIA / nvtrust

Ancillary open source software to support confidential computing on NVIDIA GPUs
Apache License 2.0
207 stars 30 forks source link

Unable to determine the device handle for GPU0009:01:00.0: Unknown Error #30

Closed CasellaJr closed 5 months ago

CasellaJr commented 11 months ago

Hello.

I enabled the cc mode with the provided gpu_cc_tool.py with python3 ./gpu_cc_tool.py --gpu-name=H100 --set-cc-mode=on --reset-after-cc-mode-switch. Output:

NVIDIA GPU Tools version 535.86.06
Topo:
  PCI 0009:00:00.0 0x10de:0x22b1
   GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000
2023-12-29,11:51:50.039 INFO     Selected GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000
2023-12-29,11:51:50.191 INFO     GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000 CC mode set to on. It will be active after GPU reset.
2023-12-29,11:51:51.976 INFO     GPU 0009:01:00.0 H100-PCIE 0x2342 BAR0 0x661002000000 was reset to apply the new CC mode.

Then, when doing nvidia-smi, I have the following error: Unable to determine the device handle for GPU0009:01:00.0: Unknown Error

My architecture: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Vendor ID: ARM Model: 0 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3447.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00

Even if I had this error, switching to cc mode off was working. Unfortunately, I do not have the output, because I did sudo reboot. After reboot: No devices were found. After this I ran nvidia-bug-report.sh, that is attached in this discussion. Finally, below the output of ubuntu-drivers devices:

== /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0 ==
modalias : pci:v000010DEd00002342sv000010DEsd00001809bc03sc02i00
vendor   : NVIDIA Corporation
manual_install: True
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free recommended
driver   : nvidia-driver-535-open - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

nvidia-bug-report.log.gz

Tan-YiFan commented 11 months ago

It seems that ccmode=on builds a firewall that disables the host from operating this GPU (including nvidia-smi and run cuda applications). After you set the CC mode to on, you should bind this GPU to VFIO and pass through to the guest VM. In the guest, nvidia-smi should work.

Tan-YiFan commented 11 months ago

By the way, are you working on ARM CCA? Nvidia's guide only lists Intel TDX and AMD SEV-SNP in supported CPUs.

CasellaJr commented 11 months ago

Sorry, I do not know what is VFIO, so I do not know how to pass through it. Moreover, yes, I am using ARM... BTW, after reboot and after 1 hour, nvidia-smi is working again. However, for example if I try to run a docker container, then it seems not working properly:

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ System not yet initialized (error 802) ]]
Tan-YiFan commented 11 months ago

Nvidia Confidential Computing works with confidential virtual machines. You can refer to deployment guide provided by Nvidia. Section "Identifying the GPUs to be Passed Through to the Guest" tells the command related to vfio.

CasellaJr commented 11 months ago

But, according to the whitepaper, it seems that ARM CCA is supported.

Tan-YiFan commented 11 months ago

OK. You could try using H100 in CCA VM. I guess H100 CC does not work on the host machine.

Since AMD SEV-SNP works, CCA should work. Nvidia driver in the guest communicates with the GPU in the shared memory region, which is out of the enclave. There won't be differences in the usage of shared memory among these CVM architecture.

CasellaJr commented 11 months ago

Sorry, just to be sure. In order to use the CC mode, should I start following the guide just from the point you told me? "Identifying the GPUs to be Passed Through to the Guest". Because I started from the beginning, and I am encountering a lot of errors when preparing to build the kernel.

build.py...
 : error 7000: Failed to execute command
    make tbuild [/root/brunofolder/AMDSEV/ovmf/Build/OvmfX64/DEBUG_GCC5/X64/NetworkPkg/Library/DxeIpIoLib/DxeIpIoLib]

build.py...
 : error 7000: Failed to execute command
    make tbuild [/root/brunofolder/AMDSEV/ovmf/Build/OvmfX64/DEBUG_GCC5/X64/NetworkPkg/Library/DxeNetLib/DxeNetLib]

build.py...
 : error F002: Failed to build module
    /root/brunofolder/AMDSEV/ovmf/NetworkPkg/Library/DxeIpIoLib/DxeIpIoLib.inf [X64, GCC5, DEBUG]

- Failed -
Build end time: 14:04:45, Dec.29 2023
Build total time: 00:00:03

ERROR: nice build -q --cmd-len=64436 -DDEBUG_ON_SERIAL_PORT=TRUE -n 72 -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc
Tan-YiFan commented 11 months ago

Are you booting an AMD SEV-SNP VM or ARM CCA VM?

If you are booting CCA VM, you should begin with "Identifying the GPUs to be Passed Through to the Guest". After you finish this step, add -device vfio-pci,host=09:01.0 to the qemu script, and then boot the VM.

CasellaJr commented 11 months ago

So, basically the steps are the following:

  1. Install requirements
    sudo apt update
    sudo apt install -y ninja-build iasl nasm flex bison openssl dkms autoconf zlib1g-dev python3-pip libncurses-dev libssl-dev libelf-dev libudev-dev libpci-dev libiberty-dev libtool libsdl-console libsdl-console-dev libpango1.0-dev libjpeg8-dev libpixman-1-dev libcairo2-dev libgif-dev libglib2.0-dev git-lfs jq qemu-system
    sudo  pip3 install numpy flex bison
  2. Downloading GitHub Packages
    mkdir my_folder
    cd my_folder
    $ git clone https://github.com/AMDESE/AMDSEV.git
    $ git clone https://github.com/NVIDIA/nvtrust.git
  3. Identifying the GPUs to be Passed Through to the Guest
    lspci -d 10de:
    sudo sh -c "echo 10de xxxx > /sys/bus/pci/drivers/vfio-pci/new_id"
  4. Modify launch_vm.sh like this:
    
    #
    #  Copyright (c) 2023  NVIDIA CORPORATION & AFFILIATES. All rights reserved.
    #
    AMD_SEV_DIR=/shared/AMDSEV/snp-release-2023-07-18
    VDD_IMAGE=/shared/nvtrust/host_tools/sample_kvm_scripts/images/ubuntu22.04.qcow2

Hardware Settings

NVIDIA_GPU=45:00.0 MEM=64 #in GBs FWDPORT=9899

doecho=false docc=true

while getopts "exp:" flag do case ${flag} in e) doecho=true;; x) docc=false;; p) FWDPORT=${OPTARG};; esac done

NVIDIA_GPU=$(lspci -d 10de: | awk '/NVIDIA/{print $1}') NVIDIA_PASSTHROUGH=$(lspci -n -s $NVIDIA_GPU | awk -F: '{print $4}' | awk '{print $1}')

if [ "$doecho" = true ]; then echo 10de $NVIDIA_PASSTHROUGH > /sys/bus/pci/drivers/vfio-pci/new_id fi

if [ "$docc" = true ]; then USE_HCC=true fi

$AMD_SEV_DIR/usr/local/bin/qemu-system-x86_64 \ ${USE_HCC:+ -machine confidential-guest-support=sev0,vmport=off} \ ${USE_HCC:+ -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1} \ -enable-kvm -nographic -no-reboot \ -cpu EPYC-v4 -machine q35 -smp 12,maxcpus=31 -m ${MEM}G,slots=2,maxmem=512G \ -drive if=pflash,format=raw,unit=0,file=$AMD_SEV_DIR/usr/local/share/qemu/OVMF_CODE.fd,readonly=on \ -drive file=$VDD_IMAGE,if=none,id=disk0,format=qcow2 \ -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \ -device scsi-hd,drive=disk0 \ -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \ -netdev user,id=vmnic,hostfwd=tcp::$FWDPORT-:22 \ -device pcie-root-port,id=pci.1,bus=pcie.0 \ -device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 \ -device vfio-pci,host=09:01.0 \ -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144

Tan-YiFan commented 11 months ago

-object sev-snp-guest implies that the script is for AMD SEV-SNP VMs.

CasellaJr commented 11 months ago

And the other variables are correct? I mean these: AMD_SEV_DIR, VDD_IMAGE, NVIDIA_GPU, MEM, FWDPORT

I am sure that NVIDIA_GPU needs to be = 0009:01:00.0 according to the output of lspci -d 10de:, that is:

0000:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 22b8
0006:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0008:00:00.0 PCI bridge: NVIDIA Corporation Device 22b9
0009:00:00.0 PCI bridge: NVIDIA Corporation Device 22b1
0009:01:00.0 3D controller: NVIDIA Corporation Device 2342 (rev a1)

But what about the other variables?

Tan-YiFan commented 11 months ago

If you target CCA VMs, you should first prepare a qemu script for booting a CCA VM without H100.

The qemu script provided by Nvidia could be divided into two parts:

CasellaJr commented 11 months ago

Sorry, but I have no idea. Never worked on this stuff. This is my actual launch_vm.sh, that is inside /nvtrust/host_tools/sample_kvm_scripts.

#
#  Copyright (c) 2023  NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
AMD_SEV_DIR=/root/brunofolder/AMDSEV/snp-release-2023-07-18
VDD_IMAGE=/shared/nvtrust/host_tools/sample_kvm_scripts/images/ubuntu22.04.qcow2

#Hardware Settings
NVIDIA_GPU=09:01.0
MEM=64 #in GBs
FWDPORT=9899

doecho=false
docc=true

while getopts "exp:" flag
do
        case ${flag} in
                e) doecho=true;;
                x) docc=false;;
                p) FWDPORT=${OPTARG};;
        esac
done

NVIDIA_GPU=$(lspci -d 10de: | awk '/NVIDIA/{print $1}')
NVIDIA_PASSTHROUGH=$(lspci -n -s $NVIDIA_GPU | awk -F: '{print $4}' | awk '{print $1}')

if [ "$doecho" = true ]; then
         echo 10de $NVIDIA_PASSTHROUGH > /sys/bus/pci/drivers/vfio-pci/new_id
fi

if [ "$docc" = true ]; then
        USE_HCC=true
fi

$AMD_SEV_DIR/usr/local/bin/qemu-system-x86_64 \
${USE_HCC:+ -machine confidential-guest-support=sev0,vmport=off} \
${USE_HCC:+ id=sev0,cbitpos=51,reduced-phys-bits=1} \
-enable-kvm -nographic -no-reboot \
-cpu EPYC-v4 -machine q35 -smp 12,maxcpus=31 -m ${MEM}G,slots=2,maxmem=512G \
-drive if=pflash,format=raw,unit=0,file=$AMD_SEV_DIR/usr/local/share/qemu/OVMF_CODE.fd,readonly=on \
-drive file=$VDD_IMAGE,if=none,id=disk0,format=qcow2 \
-device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \
-device scsi-hd,drive=disk0 \
-device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \
-netdev user,id=vmnic,hostfwd=tcp::$FWDPORT-:22 \
-device pcie-root-port,id=pci.1,bus=pcie.0 \
-device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 \
-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144

How can I split this?

Tan-YiFan commented 11 months ago

This repo does not provide scripts for ARM CCA.

To run H100 CC on ARM, you should:

steven-bellock commented 11 months ago

ARM CCA is currently not supported. That whitepaper is more forward-looking, ie there will come a day when it is supported.

hiroki-chen commented 11 months ago

This repo does not provide scripts for ARM CCA.

To run H100 CC on ARM, you should:

  • Check whether the CPU support CCA. If not, Nvidia CC could not work on the machine.
  • Then, prepare a script to run CCA VM.
  • Add -device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 to the script.

@Tan-YiFan One interesting thing is that CCA hasn't been ratified yet (as far as I know), and NVIDIA is blocked due to upstream updates.

hiroki-chen commented 11 months ago

@CasellaJr If you want to use CC functionalities of H100 GPUs, you must have an AMD-SNP or Intel TDX supported machine available on your side and configure the machine according to NVIDIA's manual. The deployment guide only mentions TDX and SNP. So, you cannot use H100 + CC on an unsupported machine but you can certainly utilize its computing resources to, e.g., train models more productively.

If you enable CC mode for H100 then it will block all host IO requests since H100 assumes the host platform is completely untrusted.

hiroki-chen commented 11 months ago

Also, there is one thing we must be aware of. That is, the H100's CC functionalities are still in the phase of early access and many features are subject to frequent change. I believe CCA will soon be supported in the coming future :)

CasellaJr commented 11 months ago

Hi everyone. I have now access to another machine with H100 and Intel processor Xeon Gold 6438Y+, that it seems to support TDX https://www.intel.com/content/www/us/en/products/sku/232382/intel-xeon-gold-6438y-processor-60m-cache-2-00-ghz/specifications.html However, in this machine I have AlmaLinux, while the deployment guide refers to Ubuntu. Do you think is it compatible the guide?

Tan-YiFan commented 11 months ago

@CasellaJr

  1. To check whether the CPU supports TDX, you can refer to https://github.com/intel/tdx-tools/tree/tdx-1.5#12-hardware-availability.
  2. Steps in Setting Up the Host OS (Intel TDX) would be incompatible. Does your machine support rhel-8? If so, you can refer to https://github.com/intel/tdx-tools/tree/2023ww15/build/rhel-8 for installing the host OS. Other steps are compatible (you can boot a Ubuntu guest VM).