[Question]: Does amdgpu support PCIe p2p dma copy with FPGA?

littlewu2508 commented 9 months ago

Problem Description

I'm currently interested in p2p data transfer from FPGA (Xilinx Alveo U50) to an AMDGPU. There are already implementation for FPGA-Nvidia GPU at https://github.com/RC4ML/FpgaNIC, using https://github.com/NVIDIA/gdrcopy, and in the past there are researches [1,2] achieving that with DirectGMA. But DirectGMA is now deprecated along with the proprietary fglrx driver. I wonder with the open source amdgpu driver, is there any similar methods?

[1] http://dx.doi.org/10.1088/1748-0221/11/02/P02007 [2] http://dx.doi.org/10.1088/1748-0221/12/03/C03015

I read some source code about dma p2p copy in https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/ and https://github.com/ROCm/ROCT-Thunk-Interface/tree/master/tests/kfdtest/, it seems that all the userspace dma copy are utilizing the hsa driver. But as I know currently there's no hsa support in Xilinx Alveo cards (maybe there's on-going work), so I wonder if it's possible for dma p2p between FPGA and AMDGPU

I also raised this question in https://github.com/openucx/ucx/issues/9598 and found Xilinx Alveo cards support PCIe dma p2p, via opencl on XRT. Does that mean I can use opencl to achieve p2p between FPGA and AMDGPU? However as I understand rocm-opencl-runtime is also based on hsa.

Operating System

Debian 12

CPU

AMD EPYC 7702 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 2 months ago

@littlewu2508 Internal ticket has been created to assist with your question. Thanks!

tcgu-amd commented 1 month ago

Hi @littlewu2508 I think currently, achieving p2p between Xilinx and AMD GPU is not directly supported. However, one potential work-around is to create a p2p buffer on the FPGA then map it to the host memory space following this documentation. Then, register the base pointer of the mapped p2p buffer on the GPU device with something like hipHostRegister(). Afterwards, we can use hipHostGetDevicePointer to obtain a device pointer through which one may potentially interact with the FPGA directly.

Now, I haven't tested this myself since we currently don't have a test set up specifically for this configuration. However, this should work in theory. Please let me know if this sounds reasonable and if it works on your end.

Thanks!

littlewu2508 commented 1 month ago

Hi @littlewu2508 I think currently, achieving p2p between Xilinx and AMD GPU is not directly supported. However, one potential work-around is to create a p2p buffer on the FPGA then map it to the host memory space following this documentation. Then, register the base pointer of the mapped p2p buffer on the GPU device with something like hipHostRegister(). Afterwards, we can use hipHostGetDevicePointer to obtain a device pointer through which one may potentially interact with the FPGA directly.

Now, I haven't tested this myself since we currently don't have a test set up specifically for this configuration. However, this should work in theory. Please let me know if this sounds reasonable and if it works on your end.

Thanks!

Thank you very much for the idea! I will try it out when I have time.

Also, will data go through host memory in this workaround? The motivation of p2p DMA copy is to increase the bandwidth and lower latency of data transfer.

tcgu-amd commented 1 month ago

Also, will data go through host memory in this workaround? The motivation of p2p DMA copy is to increase the bandwidth and lower latency of data transfer.

@littlewu2508. If everything works out, the GPU should be directly accessing the FPGA memory without using the host buffer. The idea is that hipHostRegister() will coordinate with the OS to translate and pin the target host memory, which in our case would be the the virtually mapped FPGA memory. This involves a virtual-to-physical memory translation in the OS, after which the GPU is given a device pointer that should correspond to the physical memory address (which is on the FPGA). It can then directly access the FPGA memory with the help of the PCIe controller. The main troublesome part is the address translation in the OS, because that involves the GPU and the FPGA drivers, as well as the PCIe.

tcgu-amd commented 2 weeks ago

Hi @littlewu2508, I will be closing this issue now due to inactivity. Please feel free to reopen for more follow-ups. Thanks!

ROCm / ROCK-Kernel-Driver