Open wdsun1008 opened 11 months ago
Do you see any messages from the host kernel or QEMU when the failure occurs?
Do you see any messages from the host kernel or QEMU when the failure occurs?
I didn't see any messages, but after using discard=none, I encountered another problem where both nvidia-smi, TensorFlow, and PyTorch can recognize the GPU, but when I use CUDA, it freezes and the CPU usage is at 100% while the GPU usage is at 0%. I noticed an issue from Kata Containers #78 that mentioned UPM and GPU are currently not compatible. Therefore, I rolled back to the 5.19 rc6 kernel and snp-v3 qemu and started the VM in the same way as before. However, when I used nvidia-smi, the host kernel crashed, and I had to restart the physical machine. I'm not sure what the reason is and was wondering if @zvonkok could give me some advice.
@wdsun1008 Please see early access for CC GPU: https://github.com/nvidia/nvtrust. There is a deployment guide with all bits and pieces.
Most importantly, you need the correct drivers and HW especially a Hopper GPU. Feel free to ping me if you have any questions.
DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.
If you need confidential container support, we will release all the needed bits by EOW ping me on Kata or Confidential Containers Slack. My Slack handle is the same as my github handle.
@zvonkok Thanks so much! Is there any possible to run non-CC GPU with CVM?
@wdsun1008
DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.
@wdsun1008
DMAs are fully untrusted in a TEE, hence anything will fail with non-CC GPUs.
OK, I got it. I think it would be useful to combine CVM with non-CC GPUs. It may not be entirely safe, but it could be considered as an option to make CVM more widely used.
I'm trying to use SNP with Nvidia GPU passthrough. The virt-install command I'm using was:
Then I got guest kernel PSC failed error and the VM hang forever:
VM without GPU passthrough and without SNP are all good. Then I changed discard=both to discard=none, problem solved. Is there any docs about UPM? I'm curious about the differences between these discard modes. I've tried to do some demos before, using private mem fd and libOS to build a purely software TEE, but I'm not sure about the relationship between UPM and SNP now.