Unsupported exit-code 0x404 in #VC exception

wdsun1008 commented 10 months ago

I am trying to adapt SNP+Nvidia GPU+cuda, and I have successfully passed through the device and can use the nvidia-smi tool. However, when I try to move a tensor to the GPU using .to("cuda") in PyTorch, the program crashes with a SIGBUS and the kernel shows the error message "SEV: Unsupported exit-code 0x404 in #VC exception (IP: 0x7f7e6bbb2bc0)". · What could be the possible reasons causing this issue? I'm using sev-snp-iommu-avic_5.19-rc6_v4 + https://github.com/AMDESE/AMDSEV/issues/109 patch host kernel, no more host kernel panic. For guest, I'm using ubuntu 22.10 5.19 kernel.

tlendacky commented 10 months ago

A 0x404 error code indicates that a page is mapped with the encryption bit set, but the page hasn't been validated. It sounds like the backing memory is actually shared memory, but since you are in userspace (based on the IP address shown), all memory is referenced as encrypted and so a #VC with error code 0x404 is generated when accessing the memory.

wdsun1008 commented 10 months ago

A 0x404 error code indicates that a page is mapped with the encryption bit set, but the page hasn't been validated. It sounds like the backing memory is actually shared memory, but since you are in userspace (based on the IP address shown), all memory is referenced as encrypted and so a #VC with error code 0x404 is generated when accessing the memory.

So it looks like Nvidia kernel driver and GPU device should be able to access the the shared memory, but the cuda library access the address from userspace cause the error? Is there any possibility to solve the problem?

wdsun1008 commented 10 months ago

@tlendacky I have another question: After reading amd.com/system/files/TechDocs/24593.pdf, I came across the relevant information about RMP memory check in Table 15-39. If cbit is 0, theoretically, that address can be accessed without triggering a VC. Regarding the mentioned "shared memory," is it allocated by dma_alloc_coherent? How can I allocate non-encrypted memory on the device with the cbit cleared?

tlendacky commented 10 months ago

A 0x404 error code indicates that a page is mapped with the encryption bit set, but the page hasn't been validated. It sounds like the backing memory is actually shared memory, but since you are in userspace (based on the IP address shown), all memory is referenced as encrypted and so a #VC with error code 0x404 is generated when accessing the memory.

So it looks like Nvidia kernel driver and GPU device should be able to access the the shared memory, but the cuda library access the address from userspace cause the error? Is there any possibility to solve the problem?

I don't have any experience with the cuda library, so I don't know how it is getting access to the memory range. There's always a possibility that the proper mapping could be created, but I don't know enough to advise on that.

@tlendacky I have another question: After reading amd.com/system/files/TechDocs/24593.pdf, I came across the relevant information about RMP memory check in Table 15-39. If cbit is 0, theoretically, that address can be accessed without triggering a VC. Regarding the mentioned "shared memory," is it allocated by dma_alloc_coherent? How can I allocate non-encrypted memory on the device with the cbit cleared?

It all depends. If the device is issuing dma_alloc_coherent(), then yes, that memory will be marked shared/un-encrypted in the kernel. If the device is doing a dma_map_page(), then the swiotlb is used to bounce the data from the (likely) encrypted buffer being mapped into the un-encrypted swiotlb. So it goes back to what memory you are trying to access, where that access is happening and how the mapping is being created.

wdsun1008 commented 10 months ago

Thanks so much for your reply. I tested sev and the Nvidia 4090+ with the latest driver. The CUDA library does not throw any errors, but the data transferred to the GPU appears as encrypted. In the open-source code of the Nvidia driver, I noticed that they use "dma_alloc_coherent" for UVM. I will conduct further tests to see if it works properly in the UVM scenario.

zvonkok commented 10 months ago

NVIDIA released the early access of the Confiential Compute stack, enabling H100 GPUs with SEV-SNP. https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/ You need the proper HW and SW to make SEV-SNP work.

Tan-YiFan commented 10 months ago

I do not have access to CVM+GPU, but here are some of my observations from the source code and document of CVM and H100 confidential computing:

GPU could not launch DMA into CVM private memory.
Nvidia driver would not activate the CC (Confidential Computing) mode if the GPU is not H100 series (Hopper Architecture).
.to("cuda") in Pytorch would call cudaMemcpy cuda API in implementation. In this memcpy, the src address is CPU memory and dest is GPU memory.

I guess the #VC is caused by GPU launching DMA from CVM private memory:

Nvidia driver does not activate the CC mode.
Cuda driver treats the .to('cuda') as non-CC execution, which causes DMA to private memory.

Possible solution: Run some C code using different Cuda APIs find out the proper DMA method.

CudaMallocManaged (UVM)
CudaMalloc + CudaMemcpy
CudaMallocHost + CudaMemcpyAsync (locked pages on private memory)

JaewonHur commented 8 months ago

@wdsun1008 Did you resolve this issue? I am trying to use Non-CC Nvidia GPU with AMD SEV-SNP, and I'm receiving the same 0x404 #VC exception as this issue. Was there any progress with it?

wdsun1008 commented 8 months ago

@wdsun1008 Did you resolve this issue? I am trying to use Non-CC Nvidia GPU with AMD SEV-SNP, and I'm receiving the same 0x404 #VC exception as this issue. Was there any progress with it?

I have not been successful in getting the non-cc mode GPU and AMD SEV to work properly together. From my understanding, the 0x404 error arises when memory mappings created in the user space are accessed by DMA, causing problems. Typically, there's no such check in a non-snp environment. However, user-space mappings are encrypted by default, which means the GPU receives cipher-text data, and therefore cannot operate normally.

A possible fix to this issue might involve having user-space memory created by CUDA libraries or other frameworks mapped as non-encrypted. I'm uncertain whether decryption can be performed on the relevant memory during the bounce buffer's handling process.

I intend to continue my research into this matter and to maintain communication with other developers who are interested in this issue. Please do not hesitate to contact me if there are any updates or concerns surrounding this topic.

JaewonHur commented 4 months ago

For those who wants to use Nvidia GPUs in SEV-SNP VMs,

Please refer sev-snp-gpu to enable Nvidia GPUs in SEV-SNP VMs.

AMDESE / AMDSEV

Unsupported exit-code 0x404 in #VC exception #177