Closed CasellaJr closed 5 months ago
It seems that ccmode=on builds a firewall that disables the host from operating this GPU (including nvidia-smi
and run cuda applications). After you set the CC mode to on, you should bind this GPU to VFIO and pass through to the guest VM. In the guest, nvidia-smi
should work.
By the way, are you working on ARM CCA? Nvidia's guide only lists Intel TDX and AMD SEV-SNP in supported CPUs.
Sorry, I do not know what is VFIO, so I do not know how to pass through it.
Moreover, yes, I am using ARM...
BTW, after reboot and after 1 hour, nvidia-smi
is working again. However, for example if I try to run a docker container, then it seems not working properly:
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ System not yet initialized (error 802) ]]
Nvidia Confidential Computing works with confidential virtual machines. You can refer to deployment guide provided by Nvidia. Section "Identifying the GPUs to be Passed Through to the Guest" tells the command related to vfio.
But, according to the whitepaper, it seems that ARM CCA is supported.
OK. You could try using H100 in CCA VM. I guess H100 CC does not work on the host machine.
Since AMD SEV-SNP works, CCA should work. Nvidia driver in the guest communicates with the GPU in the shared memory region, which is out of the enclave. There won't be differences in the usage of shared memory among these CVM architecture.
Sorry, just to be sure. In order to use the CC mode, should I start following the guide just from the point you told me? "Identifying the GPUs to be Passed Through to the Guest". Because I started from the beginning, and I am encountering a lot of errors when preparing to build the kernel.
build.py...
: error 7000: Failed to execute command
make tbuild [/root/brunofolder/AMDSEV/ovmf/Build/OvmfX64/DEBUG_GCC5/X64/NetworkPkg/Library/DxeIpIoLib/DxeIpIoLib]
build.py...
: error 7000: Failed to execute command
make tbuild [/root/brunofolder/AMDSEV/ovmf/Build/OvmfX64/DEBUG_GCC5/X64/NetworkPkg/Library/DxeNetLib/DxeNetLib]
build.py...
: error F002: Failed to build module
/root/brunofolder/AMDSEV/ovmf/NetworkPkg/Library/DxeIpIoLib/DxeIpIoLib.inf [X64, GCC5, DEBUG]
- Failed -
Build end time: 14:04:45, Dec.29 2023
Build total time: 00:00:03
ERROR: nice build -q --cmd-len=64436 -DDEBUG_ON_SERIAL_PORT=TRUE -n 72 -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc
Are you booting an AMD SEV-SNP VM or ARM CCA VM?
If you are booting CCA VM, you should begin with "Identifying the GPUs to be Passed Through to the Guest". After you finish this step, add -device vfio-pci,host=09:01.0
to the qemu script, and then boot the VM.
So, basically the steps are the following:
sudo apt update
sudo apt install -y ninja-build iasl nasm flex bison openssl dkms autoconf zlib1g-dev python3-pip libncurses-dev libssl-dev libelf-dev libudev-dev libpci-dev libiberty-dev libtool libsdl-console libsdl-console-dev libpango1.0-dev libjpeg8-dev libpixman-1-dev libcairo2-dev libgif-dev libglib2.0-dev git-lfs jq qemu-system
sudo pip3 install numpy flex bison
mkdir my_folder
cd my_folder
$ git clone https://github.com/AMDESE/AMDSEV.git
$ git clone https://github.com/NVIDIA/nvtrust.git
lspci -d 10de:
sudo sh -c "echo 10de xxxx > /sys/bus/pci/drivers/vfio-pci/new_id"
launch_vm.sh
like this:
#
# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
AMD_SEV_DIR=/shared/AMDSEV/snp-release-2023-07-18
VDD_IMAGE=/shared/nvtrust/host_tools/sample_kvm_scripts/images/ubuntu22.04.qcow2
NVIDIA_GPU=45:00.0 MEM=64 #in GBs FWDPORT=9899
doecho=false docc=true
while getopts "exp:" flag do case ${flag} in e) doecho=true;; x) docc=false;; p) FWDPORT=${OPTARG};; esac done
NVIDIA_GPU=$(lspci -d 10de: | awk '/NVIDIA/{print $1}') NVIDIA_PASSTHROUGH=$(lspci -n -s $NVIDIA_GPU | awk -F: '{print $4}' | awk '{print $1}')
if [ "$doecho" = true ]; then echo 10de $NVIDIA_PASSTHROUGH > /sys/bus/pci/drivers/vfio-pci/new_id fi
if [ "$docc" = true ]; then USE_HCC=true fi
$AMD_SEV_DIR/usr/local/bin/qemu-system-x86_64 \ ${USE_HCC:+ -machine confidential-guest-support=sev0,vmport=off} \ ${USE_HCC:+ -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1} \ -enable-kvm -nographic -no-reboot \ -cpu EPYC-v4 -machine q35 -smp 12,maxcpus=31 -m ${MEM}G,slots=2,maxmem=512G \ -drive if=pflash,format=raw,unit=0,file=$AMD_SEV_DIR/usr/local/share/qemu/OVMF_CODE.fd,readonly=on \ -drive file=$VDD_IMAGE,if=none,id=disk0,format=qcow2 \ -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \ -device scsi-hd,drive=disk0 \ -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \ -netdev user,id=vmnic,hostfwd=tcp::$FWDPORT-:22 \ -device pcie-root-port,id=pci.1,bus=pcie.0 \ -device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 \ -device vfio-pci,host=09:01.0 \ -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144
-object sev-snp-guest
implies that the script is for AMD SEV-SNP VMs.
And the other variables are correct? I mean these: AMD_SEV_DIR, VDD_IMAGE, NVIDIA_GPU, MEM, FWDPORT
I am sure that NVIDIA_GPU needs to be = 0009:01:00.0 according to the output of lspci -d 10de:
, that is:
0000:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0002:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 22b8
0006:00:00.0 PCI bridge: NVIDIA Corporation Device 22b2
0008:00:00.0 PCI bridge: NVIDIA Corporation Device 22b9
0009:00:00.0 PCI bridge: NVIDIA Corporation Device 22b1
0009:01:00.0 3D controller: NVIDIA Corporation Device 2342 (rev a1)
But what about the other variables?
If you target CCA VMs, you should first prepare a qemu script for booting a CCA VM without H100.
The qemu script provided by Nvidia could be divided into two parts:
-device vfio-pci,host=$NVIDIA_GPU,bus=pci.1
to pass through H100.Sorry, but I have no idea. Never worked on this stuff.
This is my actual launch_vm.sh
, that is inside /nvtrust/host_tools/sample_kvm_scripts
.
#
# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
AMD_SEV_DIR=/root/brunofolder/AMDSEV/snp-release-2023-07-18
VDD_IMAGE=/shared/nvtrust/host_tools/sample_kvm_scripts/images/ubuntu22.04.qcow2
#Hardware Settings
NVIDIA_GPU=09:01.0
MEM=64 #in GBs
FWDPORT=9899
doecho=false
docc=true
while getopts "exp:" flag
do
case ${flag} in
e) doecho=true;;
x) docc=false;;
p) FWDPORT=${OPTARG};;
esac
done
NVIDIA_GPU=$(lspci -d 10de: | awk '/NVIDIA/{print $1}')
NVIDIA_PASSTHROUGH=$(lspci -n -s $NVIDIA_GPU | awk -F: '{print $4}' | awk '{print $1}')
if [ "$doecho" = true ]; then
echo 10de $NVIDIA_PASSTHROUGH > /sys/bus/pci/drivers/vfio-pci/new_id
fi
if [ "$docc" = true ]; then
USE_HCC=true
fi
$AMD_SEV_DIR/usr/local/bin/qemu-system-x86_64 \
${USE_HCC:+ -machine confidential-guest-support=sev0,vmport=off} \
${USE_HCC:+ id=sev0,cbitpos=51,reduced-phys-bits=1} \
-enable-kvm -nographic -no-reboot \
-cpu EPYC-v4 -machine q35 -smp 12,maxcpus=31 -m ${MEM}G,slots=2,maxmem=512G \
-drive if=pflash,format=raw,unit=0,file=$AMD_SEV_DIR/usr/local/share/qemu/OVMF_CODE.fd,readonly=on \
-drive file=$VDD_IMAGE,if=none,id=disk0,format=qcow2 \
-device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \
-device scsi-hd,drive=disk0 \
-device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \
-netdev user,id=vmnic,hostfwd=tcp::$FWDPORT-:22 \
-device pcie-root-port,id=pci.1,bus=pcie.0 \
-device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 \
-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144
How can I split this?
This repo does not provide scripts for ARM CCA.
To run H100 CC on ARM, you should:
-device vfio-pci,host=$NVIDIA_GPU,bus=pci.1
to the script.ARM CCA is currently not supported. That whitepaper is more forward-looking, ie there will come a day when it is supported.
This repo does not provide scripts for ARM CCA.
To run H100 CC on ARM, you should:
- Check whether the CPU support CCA. If not, Nvidia CC could not work on the machine.
- Then, prepare a script to run CCA VM.
- Add
-device vfio-pci,host=$NVIDIA_GPU,bus=pci.1
to the script.
@Tan-YiFan One interesting thing is that CCA hasn't been ratified yet (as far as I know), and NVIDIA is blocked due to upstream updates.
@CasellaJr If you want to use CC functionalities of H100 GPUs, you must have an AMD-SNP or Intel TDX supported machine available on your side and configure the machine according to NVIDIA's manual. The deployment guide only mentions TDX and SNP. So, you cannot use H100 + CC on an unsupported machine but you can certainly utilize its computing resources to, e.g., train models more productively.
If you enable CC mode for H100 then it will block all host IO requests since H100 assumes the host platform is completely untrusted.
Also, there is one thing we must be aware of. That is, the H100's CC functionalities are still in the phase of early access and many features are subject to frequent change. I believe CCA will soon be supported in the coming future :)
Hi everyone. I have now access to another machine with H100 and Intel processor Xeon Gold 6438Y+, that it seems to support TDX https://www.intel.com/content/www/us/en/products/sku/232382/intel-xeon-gold-6438y-processor-60m-cache-2-00-ghz/specifications.html However, in this machine I have AlmaLinux, while the deployment guide refers to Ubuntu. Do you think is it compatible the guide?
@CasellaJr
Setting Up the Host OS (Intel TDX)
would be incompatible. Does your machine support rhel-8? If so, you can refer to https://github.com/intel/tdx-tools/tree/2023ww15/build/rhel-8 for installing the host OS. Other steps are compatible (you can boot a Ubuntu guest VM).
Hello.
I enabled the cc mode with the provided
gpu_cc_tool.py
withpython3 ./gpu_cc_tool.py --gpu-name=H100 --set-cc-mode=on --reset-after-cc-mode-switch
. Output:Then, when doing
nvidia-smi
, I have the following error:Unable to determine the device handle for GPU0009:01:00.0: Unknown Error
My architecture: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 72 On-line CPU(s) list: 0-71 Vendor ID: ARM Model: 0 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3447.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00
Even if I had this error, switching to cc mode off was working. Unfortunately, I do not have the output, because I did
sudo reboot
. After reboot:No devices were found
. After this I rannvidia-bug-report.sh
, that is attached in this discussion. Finally, below the output ofubuntu-drivers devices
:nvidia-bug-report.log.gz