Slow read access to driver allocated memory

woodmeister123 commented 5 years ago

I'm finding that read access to memory allocated by the driver is really slow, significantly slower than normal userspace memory. Writing seems fine. This is from a single memcpy of the entire buffer (2MB).

The DMA itself is working fine and I am getting good throughput.

Has anybody else come across this issue?

woodmeister123 commented 5 years ago

I'm guessing this is to do with the fact that the memory can't be cached, but should this cause a difference for a large block read?

bperez77 commented 5 years ago

What sort of throughput are you seing for this memcpy? Is it close to the theoretical peak DRAM throughput? The cache does actually make a difference here, because the upper levels of the cache can overlap the fetching of data from DRAM with coyping data to the destination. I wouldn't expect a giant difference, but it could make somewhat of a difference.

Unfortunately, the DRAM buffers are explicitly marked as being cached. This is done because the FPGA is not cache-coherent with respect to the processor, so it would lead to some nasty bugs if this wasn't the case. The ZYNQ processing system does have the Accelerated Coherency Port (ACP), which is cache-coherent with the processors. However, there's only one of these ports, so the driver instead focuses on High Performance (HP) ports which give more flexibilty to the user.

The other option would be to have the user manually manage handling flushing the cache. However, this would require the user to go to kernel space every time they want to invalidate or flush cache entries. Also, this would be extremely messy from a user interface standpoint, as doing this from a userspace program would be decently complicated. Additionally, you may not end seeing much of a performance boost because of the need to constantly manually flush the cache.

What sort of behavior are you trying to achieve in your application? I could suggest some potential alternatives.

woodmeister123 commented 5 years ago

My application is receive only and essentially using the DMA to stream blocks of a streaming channel on onto the PS. Eventually I want to be able to scale this up to as high a rate as possible.

I'll do some proper benchmarking to try and prove the effect is as dramatic as it feels so far. Transfers out of RAM are failing to keep up at somewhere between 50-100MBps at the moment, the DMA transfers themselves are fine at this point, I can get them running up to 200MBps withough problems, which is the maximum my design can deliver currently.

woodmeister123 commented 5 years ago

Ok, so benchmarking illustrates the issue quite nicely. I get 4.9GBps writing to a buffer generated by axidma_malloc, but only 147MBps reading from it. For comparison I get 5.3GBps throughput copying between normal userspace malloc memory.

Reading around I see people having some similar issues and trying to do the manual cache sync etc, but should the difference really be this great?

bperez77 commented 5 years ago

Yeah no avoiding reading it back from DRAM in that case.

That's really surprising. I'm actually quite baffled to as to why the write speed would be an order of magnitude higher than the read speed. In general, the write speeds should be slower to DRAM than reads, but even then, it should still be relatively close. I don't think the difference should be that large, and the read/write speeds definitely should not differ by such a great amount for the non-cached case.

Which Xilinx board are you using? The Zynq-7000 processing system has DDR3, which should top out around 6400 MB/s (6.4 GB/s). Now, you naturally won't hit this peak, but I would expect it should be higher than 147 MB/s.

Would you also mind sharing your benchmarking program (preferrably attached as a file to this thread)?

woodmeister123 commented 5 years ago

This is on a ZCU102 which has DDR4, I'll tidy up my benchmarking code and upload.

woodmeister123 commented 5 years ago

memory_bandwidth_test.zip

There are three examples, writing to the dma memory, reading from it, and a pure userspace test. I've recently tried copying alternate buffers with different data in to try and expose any caching effects, but I haven't managed to make the write or userspace tests any worse. I am compiling using gcc, and the performance is pretty much the same in debug and release.

woodmeister123 commented 5 years ago

So I am currently surmising that this is a result of caching effects, although I still don't understand the difference between read and write speeds. So far I have been using the HP ports on the MPSoC, which use software to enforce coherency. The Zynq also has two HPC ports, which go through the CCI which will enforce cache coherency in hardware, and so the memory can be marked as cache-able. I guess these are similar to the ACP port, but seem to be designed to be a bit easier to use. According to this from Xilinx, if you add dma-coherent in the device tree then dma_alloc_coherent will allocate cacheable memory.

I think in this case that the driver shouldn't require modification, as this should be handled by the kernel in the dma_alloc_coherent function. I'm not sure about the mapping to userspace though, it appears from the kernel code that the kernel dma_mmap_coherent function always marks this as non-cacheable.

So far I have enabled this flag in the device tree and it hasn't affected the speed of my access from userspace, I would be interested in your thoughts.

Further update:-

So the speed issue is definitely due to caching, if I add dma-coherent to your driver in the device tree, and remove the line where you declare the mmap as non-cacheable, then the read speed becomes the same as my other tests.

So the question is, if hardware cache coherent DMA is enabled through the HPC port so that dma_alloc_coherent gives cacheable memory, it still necessary in the driver to declare the mmap as non cacheable?

bperez77 commented 5 years ago

Sorry for the late reply. I'm still quite surprised by these results. The fact that the memory is slower when not cached isn't suprising, but it's still baffling to me why reads and writes would have different speeds.

I'll need to read to double check that (I have a Zedboard, so there are no HPC ports), but if the HPC port provide a cache coherence gaurantee, then there's no reason the driver needs to mark that particular memory as non-cacheable. This was only done to simplify the API for dealing with these DMA buffers. For the HP and GP ports, if the DMA buffer was not manually marked as coherent, then the driver would have to expose some interface to flush and invalidate the buffer.

The question now becomes how to expose this ability to the user, and how the driver can appropriately set the cacheable bit. I'll need to determine that, but do you happen to know if there's any clean way to know if a DMA channel is associated with a particular port. The answer is probably not, since I imagine that wouldn't be cleanly exposed it the device tree.

woodmeister123 commented 5 years ago

Yes agree, I don't really understand the speed thing either.

From my experiemnts dma_mmap_coherent automatically set the cacheable bit when necessary, so I don't think this is required in the driver, I'll send a PR for the line I removed so you can see what i did.

bperez77 commented 5 years ago

Yeah that makes sense, there was a point where it wasn't doing that properly, but it sounds like it's been fixed since then.

woodmeister123 commented 5 years ago

Happy that this dma-coherent mode is working ok now.

eleICoto commented 5 years ago

@woodmeister123 Sorry to bother you, I encounter this problem too, I'm using ZCU104, do you mind sharing more detail about how to solve these issues

woodmeister123 commented 5 years ago

Yes, if you use dma-coherent in the device tree, and tie down appropriate bits on the AXI interface as specified by Xilinx here, then you should get coherent access on the HPC ports.

eleICoto commented 5 years ago

Ok you mean First, in block design i should use hpc port and connect appropriate bits like this Then , add dma-coherent in device tree? in chrdev node or axi node? Thanks for your replay by the way~

woodmeister123 commented 5 years ago

Yep, I put dma-coherent on both nodes, not sure if it is necessary on both nodes or not.

eleICoto commented 5 years ago

@woodmeister123 No luck, if I add dma-coherent in chrdev I can see there is a speed boost when reading dma memory buffer, but in this situation, my dma two way transfer fail, more specific the read transfer fail, I have verified my IP work properly with no dma-coherent in chrdev, i wonder where could go wrong

eleICoto commented 5 years ago

@woodmeister123 And if I add dma-coherent just in dma node, two dma work properly but still low buffer read speed

eleICoto commented 5 years ago

@woodmeister123 By the way, I'm using petalinux to build my whole system

snaillor commented 3 years ago

@ woodmeister123 I have the same problem as you, have you solved it? I added dma-coheren to the driver and added the ila core to the pl. I found that the data that was transferred at the beginning was all wrong, and then there was some data. Although the length of the data is correct, the data is wrong.

stone-sjj commented 3 years ago

@ woodmeister123 我和你有同样的问题，你解决了吗？我在驱动程序中添加了dma-coheren，并在pl中添加了ila内核。我发现开始时传输的数据都是错误的，然后有一些数据。尽管数据长度正确，但是数据错误。

你好！问题解决了吗？

bperez77 / xilinx_axidma

Slow read access to driver allocated memory #69