Buffer to Buffer Transfer Rates

austinsteamboat commented 8 years ago

I'm having an issue with buffer to buffer transfer rates. When simply looping back through the FPGA I'm observing ~450 MB/s transfer rates for CPU -> FPGA -> CPU using the xdma-demo.c with minor alterations. This is awesome performance. However, when I do a memcpy call to transfer processed data from the memory mapped buffer to a new memory buffer (like if I'd want to pass the FPGA processed data on for further processing on the CPU and clear the mmap'd buffer to load more data) the transfer rate is only ~64 MB/s. I bench-marked just a simple buffer to buffer transfer of the same size using memcpy in the c-code and the Zedboard was able to do ~256 MB/s. This leads me to believe that something like 4 memory copies have to happen to pull data from the mmap'd buffer. So my questions:

Is this expected behavior?
Is there a simple way to increase performance?

Thanks, -Austin

bmartini commented 8 years ago

Sorry to say that a slow copy out of or in to the CMA memory is expected.

The ARM on the Zynq doesn't have a good way to flush or refresh memory caches from user space. Thus we can get into a situation where the FPGA may change data values in memory but the CPU not see them or vice versa. To get around these problems we mark the CMA memory of the zynq-xdma driver as cache transparent and thus the data is never kept in cache but is instead read or written to memory directly. This makes copying or working with the data in CMA arrays very slow.

On Thu, Jun 23, 2016 at 10:37 AM, austinsteamboat notifications@github.com wrote:

I'm having an issue with buffer to buffer transfer rates. When simply looping back through the FPGA I'm observing ~450 MB/s transfer rates for CPU -> FPGA -> CPU using the xdma-demo.c with minor alterations. This is awesome performance. However, when I do a memcpy call to transfer processed data from the memory mapped buffer to a new memory buffer (like if I'd want to pass the FPGA processed data on for further processing on the CPU and clear the mmap'd buffer to load more data) the transfer rate is only ~64 MB/s. I bench-marked just a simple buffer to buffer transfer of the same size using memcpy in the c-code and the Zedboard was able to do ~256 MB/s. This leads me to believe that something like 4 memory copies have to happen to pull data from the mmap'd buffer. So my questions:

1.

Is this expected behavior? 2.

Is there a simple way to increase performance?

Thanks, -Austin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bmartini/zynq-xdma/issues/10, or mute the thread https://github.com/notifications/unsubscribe/AALiWW-zWtteDEqB5ccM2ue43K7raYOwks5qOsRWgaJpZM4I9DgT .

austinsteamboat commented 8 years ago

Thanks for the quick reply. It sounds like there's no quick fix, but good to know that performance is expected.

dirkcgrunwald commented 8 years ago

Brian, did you try using the __cpuc_flush_dcache_area and _outer_inv_range functions (see https://forums.xilinx.com/xlnx/attachments/xlnx/ELINUX/11158/1/Linux%20CPU%20to%20PL%20Access.pdf ) to flush the cache rather than making it uncached?

I don't know the speed of this, but according to comments at e.g. https://forums.xilinx.com/t5/Embedded-Linux/Flush-cache-on-Zynq-under-Linux/td-p/541815 this is the preferred solution if you can't use the cache-coherent ACP.

Is the interface for the ACP different?

bmartini commented 8 years ago

Thanks for the links, I'll have to spend a bit of time reading them to see if they can be applied to the driver. Some things that I noticed from just a quick skim; they are talking about non-DMA data transfers so I'll have to see if it can be done using DMA memory arrays, another is that the example is done using memory allocated in kernel space (I believe). In general kernel memory can't be directly written/read from user space but has to be copied into the kernel memory by the driver. This copy can be a big performance hit.

The Zynq's that I've being using have only one ACP port but 4 HP ports for DMA. I've not used the ACP port as I've always needed to have multiple streams of data to and from the memory, thus requiring the use of the HP ports. From my understanding, the ACP does not connect the PL to the memory but the the CPU cache, thus allowing for cache coherency. Because of this there is a limit to the size of the data that can be transferred in one go. However, as I said, I've never used it and its being a while since I looked at it so I could be wrong.

bmartini / zynq-xdma

Buffer to Buffer Transfer Rates #10