esa-tu-darmstadt / tapasco

The Task Parallel System Composer (TaPaSCo)
GNU Lesser General Public License v3.0
106 stars 25 forks source link

AXI Transactions for DirectDMA (local memory) #359

Open wirthjohannes opened 1 year ago

wirthjohannes commented 1 year ago

I did some experiments using the DirectDMA implementation, which is used for transferring data to and from PE-local memory (BRAM). I used an ILA directly at the PCIe bridge on the FPGA to look into the AXI transactions generated when calling the copy_to (and copy_from) method of DirectDMA for different sizes (64B,128B,192B,256B,320B). The results differ from what I expect.

Firstly, the AXI transaction sizes where always 32B (on a 64B-wide interface; only the upper or lower half of the strobe bits was set; no bursts). With some further experiments this seems to be the upper bound per transfer here, not sure exactly where this limitation comes from.

But even disregarding this there were other peculiarities: For the transfers >= 128B there were more 32B transactions than required. Looking at the ILA I found that some 32B words were transmitted multiple times.

From my experiments this does not affect correctness, as data is just transferred multiple times to the same address. However this of course still suboptimal, e.g. with regards to performance.

Details

The following tables shows the exact transfers for copy_to calls of different sizes. The left column (for each size) gives the actual transfers, the right what I would have expected

<!DOCTYPE html>

64 Byte | Exp | 128 Byte | Exp | 192 Byte | Exp | 256 Byte | Exp | 320 Byte | Exp -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x120 | 0x000 0x20 | 0x20 | 0x20 | 0x20 | 0x20 | 0x20 | 0x20 | 0x20 | 0x100 | 0x020   |   | 0x40 | 0x40 | 0x40 | 0x40 | 0x40 | 0x40 | 0x0e0 | 0x040   |   | 0x60 | 0x60 | 0x60 | 0x60 | 0x60 | 0x60 | 0x0c0 | 0x060   |   | 0x60 |   | 0xa0 | 0x80 | 0xe0 | 0x80 | 0x0a0 | 0x080   |   | 0x40 |   | 0x80 | 0xa0 | 0xc0 | 0xa0 | 0x080 | 0x0a0   |   | 0x20 |   | 0x60 |   | 0xa0 | 0xc0 | 0x060 | 0x0c0   |   | 0x00 |   | 0x40 |   | 0x80 | 0xe0 | 0x040 | 0x0e0   |   |   |   |   |   |   |   | 0x000 | 0x0100   |   |   |   |   |   |   |   | 0x020 | 0x120   |   |   |   |   |   |   |   | 0x040 |     |   |   |   |   |   |   |   | 0x060 |     |   |   |   |   |   |   |   | 0x120 |  

I also looked into the copy_from method, it behaves identical for up to 256 Bytes. For 320 Bytes (and more) it behaves differently, producing even more read transactions (e.g. 17 read transactions vs. 13 writes transactions for 320 Bytes).

jahofmann commented 1 year ago

The runtime uses AVX/SSE when available. Those registers are 32B/256Bit on most machines. You could try an AVX512 machine to see if you get 64B requests. I'm not aware of a faster way to copy data from the CPU over PCIe, if you don't want to use an on-device DMA engine.

As for the extra requests: No idea where those might come from.

wirthjohannes commented 1 year ago

Yes, that makes sense regarding the 32B transfers. Thanks