alexforencich / verilog-pcie

Verilog PCI express components
MIT License
1.1k stars 289 forks source link

AXI dma client support? #37

Closed chenbo-again closed 1 year ago

chenbo-again commented 1 year ago

Hi alex:

I want to support AXI dma client. I think that you recommend to implement something like dma_client_axi_sink/dma_client_axi_source. Is that ture?

As far as I know, there is also a choice to use pcie_us_axi_dma, but you do not recomend it. From #31, you said that " I don't recommend using it, as it only supports Xilinx US/US+ and it has some significant performance limitations due to how AXI works."

I have 3 questions:

  1. why it only support Xilinx US/US+ but not FPGA-independent interface?
  2. why it have significant performance limitations, can you tell me in detail?
  3. what method you will recommand? (if possible, can you analysis the pros and the cons?)
alexforencich commented 1 year ago
  1. because I wrote it before I came up with the segmented interface and the FPGA-independent TLP interface, and since the performance is terrible there is no reason to update it. TBH, I should probably just delete it completely. Basically I wrote the US+ PCIe AXI DMA module first, then the US+ PCIe DMA interface with the segmented interface, then the FPGA-independent PCIe DMA interface. Only the most recent one (PCIe DMA interface) is recommended for use, the others are deprecated.
  2. partially because of the interaction between AXI and PCIe, and partially due to how AXI itself works. The first issue has to do with issuing PCIe write requests. The AXI write response data has to arrive in order, so this means that the DMA engine is limited to using one one AXI ID to prevent read data interleaving, which seriously reduces the throughput. Additionally, due to how the data gets packed into PCIe TLPs, you waste a whole extra transfer cycle on the AXI side effectively for every TLP, and this results in a hit to the throughput of around 20%, but potentially even higher. The segmented interface does not have this issue because it's strictly in-order and it can access two adjacent addresses in the same clock cycle so it can support operation at 100% throughput in all cases (except for very small transfers)
  3. For an AXI master client, it should be quite similar to the streaming modules, with the streaming modules more or less directly handling the W and R channels. However, additional work may need to be done to deal with read data interleaving on the R channel. It probably makes sense to write some sort of control module as well that can accept a transfer request consisting of the AXI address, PCIe address, and length, and then the module will handle managing the buffer memory as well as issuing transfer requests to both the (PCIe) DMA interface and AXI DMA client modules. For an AXI slave client, the read side should be reasonably straightforward, but the write side requires interpreting the WSTRB signal to potentially generate a very large number of DMA transfers, as PCIe does not support arbitrary byte masking.