AMDESE / AMDSEV

AMD Secure Encrypted Virtualization
272 stars 84 forks source link

PSC failed with Nvidia GPU #174

Open wdsun1008 opened 11 months ago

wdsun1008 commented 11 months ago

I'm trying to use SNP with Nvidia GPU passthrough. The virt-install command I'm using was:

--host-device=pci_0000_25_00_0 \
--qemu-commandline="-cpu EPYC-v4 -machine memory-encryption=sev0,vmport=off -object memory-backend-memfd-private,id=ram1,size=${ITEMS[2]}M,share=true -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,${DIGEST},auth-key-enabled=on,discard=both,kernel-hashes=on -machine memory-backend=ram1,kvm-type=protected" 

Then I got guest kernel PSC failed error and the VM hang forever:

SNP: PSC failed ret=0 exit_info_2=100000

VM without GPU passthrough and without SNP are all good. Then I changed discard=both to discard=none, problem solved. Is there any docs about UPM? I'm curious about the differences between these discard modes. I've tried to do some demos before, using private mem fd and libOS to build a purely software TEE, but I'm not sure about the relationship between UPM and SNP now.

mdroth commented 11 months ago

Do you see any messages from the host kernel or QEMU when the failure occurs?

wdsun1008 commented 11 months ago

Do you see any messages from the host kernel or QEMU when the failure occurs?

I didn't see any messages, but after using discard=none, I encountered another problem where both nvidia-smi, TensorFlow, and PyTorch can recognize the GPU, but when I use CUDA, it freezes and the CPU usage is at 100% while the GPU usage is at 0%. I noticed an issue from Kata Containers #78 that mentioned UPM and GPU are currently not compatible. Therefore, I rolled back to the 5.19 rc6 kernel and snp-v3 qemu and started the VM in the same way as before. However, when I used nvidia-smi, the host kernel crashed, and I had to restart the physical machine. I'm not sure what the reason is and was wondering if @zvonkok could give me some advice.

zvonkok commented 11 months ago

@wdsun1008 Please see early access for CC GPU: https://github.com/nvidia/nvtrust. There is a deployment guide with all bits and pieces.

Most importantly, you need the correct drivers and HW especially a Hopper GPU. Feel free to ping me if you have any questions.

DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.

If you need confidential container support, we will release all the needed bits by EOW ping me on Kata or Confidential Containers Slack. My Slack handle is the same as my github handle.

wdsun1008 commented 11 months ago

@zvonkok Thanks so much! Is there any possible to run non-CC GPU with CVM?

zvonkok commented 11 months ago

@wdsun1008

DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.

wdsun1008 commented 11 months ago

@wdsun1008

DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.

OK, I got it. I think it would be useful to combine CVM with non-CC GPUs. It may not be entirely safe, but it could be considered as an option to make CVM more widely used.