bmartini / zynq-xdma

Linux Driver for the Zynq FPGA DMA engine
86 stars 38 forks source link

Only achieving 100 MB/s #6

Open bfroute-goog opened 9 years ago

bfroute-goog commented 9 years ago

I have a simple loopthru design on a Zynq 7010 working. I am using a build out of petalinux 2014.4 with Linux kernel 3.17. I have the xdma driver installed in the build with the xdma test app. I have increased my AXI DMA logicore PL buffer lengths to max (256). When I run the xdma test app and look at the output using dmesg, the most throughput I can get is about 100 MB/s. I have tried also increasing the clock frequency that goes to the PL from 100 MHZ up to 150 MHZ. There was no difference in performance. Two questions: 1. What should I expect to see? 2. What can I do to increase it.

Also, SG is turned off with simple mode set in core.

bmartini commented 9 years ago

Sorry to say I've never used this on the Zynq 7010 so I can't say if thats a good number or not. You can try to use more of the HP ports in the Zynq to have multiple stream of data coming in or out, you could also increase the PL DMA bus Width from 32 bit to 64 bits to see if that makes a difference.

There is currently no plan to add SG support to the driver but if you are interested in doing so I could pass your details on to others that might be interested in helping.

You might also want to look at https://github.com/bmartini/zynq-axis.git as an alternative to the xdma.

afilguer commented 9 years ago

I have not used the 7010. I recall having over 200MB/s running the DMA engines at 100MHz with a 32bit bus through the ACP port on a 7020 but I was not using this driver. BTW this may change depending on the data transfer size, the overall traffic through the memory bus at any given moment and many other variables.

Although it's difficult to come up with a number, it looks a little slow to me.

On the other hand. At some point I may be implementing SG transfers in order to transfer data from user space. Any details on this will be hugely appreciated :)

bmartini commented 9 years ago

This driver does allow for transfer from user space, to do so it just uses a CMA memory area which is not cache aware. This can cause some slow downs when using said memory for CPU calculations but you can always copy data in and out of the area before doing any calculation.

bmartini commented 9 years ago

The DMA data rate for long transfers is probably about the same as this is a function of the HP port and memory controller. However, I have found the over-head for using zynq-axis to be less and thus faster when you have to do lots of transfers. When using the ARM to start a transfer you will only need to send 3 register values into the PL, and while I don't know how many the Xilinx IP needs, it seems to be more. However, the real benefit comes when you need to perform multiple transfers from the PL, instead of signaling the ARM to perform the transfers the PL can perform all the transfers and only signal the ARM when they are all done. This means less interaction with the ARMs and thus faster processing time.

bfroute-goog commented 9 years ago

What I noticed with zynq-xdma is that with the default transfer (1 MB) in the built in test the overhead to setup the loop thru transfer is high. When I changed the amount up to 4 MB, the rate on the transfer went up to well over 200 MB/s.

I guess I am a little confused about zynq-axis versus zynq-xdma. Which one is better for which situation? I am currently not planning on using scatter-gather, but am most interested in the one that works best. What is the limitation of zynq-axis?

afilguer commented 9 years ago

I was thinking in user allocated memory, allocated using malloc() for instance. To do so, pages must be pinned and a SG transfer should be performed as long as I know.

I think this may be faster than copying chunks of memory back and forth.

bmartini commented 9 years ago

I believe that you will need some way to mark the all the cache lines dirty when the PS has written to the memory address and also to flush the cache lines before the PS reads from memory so that the data stays in sync between the PS and PL. There has been a lot of discussions about that sort of thing elsewhere (http://forums.xilinx.com/t5/Embedded-Linux/Flush-cache-on-Zynq-under-Linux/td-p/541815).