bperez77 / xilinx_axidma

A zero-copy Linux driver and a userspace interface library for Xilinx's AXI DMA and VDMA IP blocks. These serve as bridges for communication between the processing system and FPGA programmable logic fabric, through one of the DMA ports on the Zynq processing system. Distributed under the MIT License.
MIT License
464 stars 227 forks source link

DMA error message when using FTP or SSH #56

Closed Baum55 closed 6 years ago

Baum55 commented 6 years ago

I receive the message "xilinx-vdma 40400000.dma: Channel ef3cb010 has errors 10, cdr 0 tdr 0". I use the Zynq Zybo Z7 board with 2017.4 Linux from Xilinx. I only use one receive channel to read data from the FPGA in a high-priority thread. This thread only reads the data and gives the read pointer in a mutex protected ring buffer (The mutex does not block, even in an unsecure mutex-free version the problem still occurs). I read with a data rate of 11520000 Byte/s. Meanwhile I start a FTP or SSH data transfer. I tried to change the nice level of my application, but even with the highest level -20 the problem still exists. The problem only occurs when I have both high system utilization and network utilization. Normally there is no FTP/SSH data transfer in a production system, but I would like to understand the reason why I get the message while network activity.

bperez77 commented 6 years ago

Interesting, it's hard to say what the issue would likely be without more details. You're likely seeing a VDMAIntErr because the amount of data you're receiving does not match the amount you specified (either greater than or less than).

It's odd that doing an FTP or SSH transfer would affect if this issue occurs. The only thing I could think of is if the network controller and the AXI VDMA IP shared a DMA channel, then perhaps there is some data contention? However, I don't think this should be the case because they should be utilizing different DMA channels, and DMA transfers are a two-way handshake, so no data should ever be lost.

Can you give me the following to help you debug this:

  1. Some more specifics about the data you're receiving.
  2. The output of the dmesg command (please attach this as a file).
  3. The device tree entries relevant to AXI VDMA.
Baum55 commented 6 years ago
  1. The device is a a line camera. The FPGA reads cyclically 16 bit black and white line data. Several lines are assembled by the FPGA and transmitted as continuous memory (FIFO s_axis_tlast). This memory area is received by the application with "axidma_oneway_transfer" and interpreted as a image. Another thread sends the image via UDP to an industrial PC for further evaluation. If the application does not complete the transfer in time, the FIFO will be completely filled up and data will be lost. Please see the Osci.pdf file attachment
  2. Please see the dmesg.txt file attachment Note: The first message of "xilinx-vdma 40400000.dma: Channel ef16cf10 has errors 10, cdr 0 tdr 0" on line 178 is due to the start-up phase.
  3. axidma_chrdev: axidma_chrdev@0 { compatible = "xlnx,axidma-chrdev"; dmas = <&axi_dma_0 0>; dma-names = "rx_channel"; };

    axi_dma_0: dma@40400000 {

    dma-cells = <1>;

    clock-names = "s_axi_lite_aclk", "m_axi_sg_aclk", "m_axi_s2mm_aclk";
    clocks = <&clkc 15>, <&clkc 15>, <&clkc 15>;
    compatible = "xlnx,axi-dma-1.00.a";
    interrupt-parent = <&intc>;
    interrupts = <0 29 4>;
    reg = <0x40400000 0x10000>;
    xlnx,addrwidth = <0x20>;
    dma-channel@40400030 {
        compatible = "xlnx,axi-dma-s2mm-channel";
        dma-channels = <0x1>;
        interrupts = <0 29 4>;
        xlnx,datawidth = <0x200>;
        xlnx,device-id = <0x0>;
    };

    };

bperez77 commented 6 years ago

Great, thanks for info and waveform. This definitely sounds like a dropped data/packets issue. What I think is happening (keep in mind this is pure speculation) is that there is too much contention on the network side of your transfers. There is some coupling of receiving data from the PL and then sending it over the network. When you're simulatenously doing the FTP or SSH transfer, the transfer over the network must slow down enough that your circular ring buffer overflows. In turn, this causes the FIFO on the PL to overflow, and this eventually bubbles up to the driver as receiving less data than you expected, because of the packet drop.

Since I don't know the specifics of your applications, it's hard to determine if this is the exact cause, but it seems pretty likely to me. The solution for this problem is dependent upon the nature of FTP/SSH transfer. If these transfers are bursty in nature, then increasing the size of the ring buffer should alleviate the backpressure on the PL FIFO.

Otherwise, if this FTP/SSH transfers are consistent and at a steady rate, you'll need to either find a way to increase the transfer rate of the Ethernet device, or reduce the FPS at which the line camera operates.

I don't think this issue is related to the CPU utilization. The relevant parts of the design (at least as described above) are all handled by DMA transfers, so the CPU is only acting as a controller in that context.

Baum55 commented 6 years ago

Thank you very much for your help. That was very helpfull.

I have a consistent and steady FTP/SSH transfer rate, so that I chose your advice as solution.

bperez77 commented 6 years ago

Great, glad to hear that helped.