Receiver DMA transfer length limit of 32768

andreaskuster commented 1 year ago

First and foremost, thanks for this great platform (RFSoc4x2) and framework (PYNQ)!

While running some tests (base overlay), I came across the software-enforced maximum transaction length of 32768 samples per DMA transfer (i.e. base.radio.receiver.channel[i].transfer(number_samples)), which is quite limiting considering that the PS has 4GB of DRAM and the frontend samples at rates of up to 5GSPS, which leads to a capture window of 6.6.ms (considering DDC bypass).

I first though that this is a Xilinx DMA IP limitation (as 2^15=32768), but I figured out that the block is actually configured for transfer lengths of up to 2^26 (see screenshot below, for the current master of this repo).

Furthermore, removing the software input checking leads to successful transfers up to 37000 (log2(37000) not an integer), with a failure at 38000. Is there an easy way of lifting this limitation and offer transfer lengths at least up to 4GB length? I couldn't figure out so far why it stops working at 38000 samples, since the packet generator uses 32bit integer fields (Packet_Generator.vhd#L58), and thus should be able to have transfer lengths of 2^31.

Any hints towards this matter would be appreciated.

nathanjachimiec commented 1 year ago

Hi Andreas, I am currently using the DMA (axi_dma) to transfer data from an rfdc AXI stream to the PL-DRAM and have been transferring 2MB payloads successfully though I should be able to transfer 256MB per descriptor. First I used an axi stream width converter to convert my ADC stream to match the maximum bus-width of the PL DRAM MIG of 256 bits and use a shallow FIFO attached the DMA's S2MM. From your settings, I would recommend matching bit-widths as best as possible to align with the DRAM. If your input stream overflows or underflows the DMA then it will "error" the transfer. The axi_dma also needs a TLAST signal so it knows the end of the transfer. Forums and other documentation make it sound grim that you "must have TLAST", but you leave TLAST tied low and stream the requested write length, however, you won't be able to advance to the next or start a new descriptor. To return to normal state, you would have to soft-reset the axi_dma. So yes, it is in effect now just a data-mover. Also, verify the address that "allocate" provides you points to a physical address in DRAM range. You likely need to first provide a device tree overlay (dtbo) that specifies the PL-DRAM so it can be properly allocated. Hopefully this can get you pointed in the right direction for now.

andreaskuster commented 1 year ago

Hi @nathanjachimiec

Thank you for all the hints and sharing your experience!

I will hereby share my experience and finding on the way of resolving this issue, in order to provide hints and starting point for other people, or to resolve this once and for all in general.

1) Re-synthesis of the design, with the DMA length set to 26 bits, and commenting out of the transfer size check in the transfer function in rfsystem/hierarchies.py allowed me to increase the transfer length to 16M samples, respectively a transfer of 32MiB from the RFDC to the PL-RAM.

    def transfer(self, packetsize):
        """Returns a numpy array with inspected data of length packetsize.
        """
        transfersize = int(np.ceil(packetsize/8))
        if transfersize > 4096 or transfersize < 2:
            raise ValueError(
                'Packet size incorrect, should be in range 16 to 32768.')
        self._pgen.packetsize = transfersize
        buffer_re = allocate(shape=(transfersize*8,), dtype=np.int16)
        buffer_im = allocate(shape=(transfersize*8,), dtype=np.int16)
        self._dma_real.recvchannel.transfer(buffer_re)
        self._dma_imag.recvchannel.transfer(buffer_im)
        self._pgen.transfer = 1
        self._dma_real.recvchannel.wait()
        self._dma_imag.recvchannel.wait()
        self._pgen.transfer = 0
        re_data = np.array(buffer_re) * 2**-15
        im_data = np.array(buffer_im) * 2**-15
        buffer_re.freebuffer()
        buffer_im.freebuffer()
        c_data = re_data.astype('double') + 1j * im_data.astype('double')
        return c_data[0:packetsize]

2) Every time the transfer() function is called, contiguous memory buffers are freshly allocated. For such large buffers as used in my case, and a system memory that gets fragmented over time, this eventually leads to the inability of finding a contiguous memory junk. There are three ways of resolving this: a) Modification of the transfer function to allow buffer re-use b) Using the pynq.xlnk module that allows to allocate contiguous memory from the reserved CMA address space c) Use PL-DRAM

3) For people like me that want to transfer > 64MiB of data, the Xilinx DMA block seems to have a Furthermore, having to reset the DMA after a transfer because of the missing tlast seems suboptimal at least, and might block the usage of the cyclic functionality (see Cyclic BD Enable in Xilinx AXI DMA. Furthermore, I am not sure from the documentation if the cyclic mode would be fast enough to keep up with the RFDC line rate.

4) Thus, the solution I implemented , and that solved both problems from 3) was the integration of an external DMA IP. I packed the DMA controller from github.com/alexforencich/verilog-axi into an IP package, and integrated it into my design.

xiangumass commented 7 months ago

Hi, @andreaskuster Thank you for sharing the tips and your experience. I am also implementing a design to write data from full-speed RF ADC (5GSPS) to PL DDR on RFSoC 4x2. I want to write 300 MB to the DDR with such speed. The output interface of RF ADC is AXI Stream and the input interface of PL DDR is AXI. So there must be a converter to transfer the data. I can hardly find a converter (AXI Stream -> AXI) to support the data rate. Based on your experience, do you suggest me to use the DMA IP that you attached the link or if you found out any other solutions? Thank you for your help!

andreaskuster commented 6 months ago

Hi @xiangumass

Thank you for reaching out, and I am glad to see that others are getting their hands dirty with low-level RFSoC4x2 modifications.

Well, the AXI-DMA-v7.1 block supports AXI-Stream to AXI conversion, and to the best of my knowledge also at line rate. However, you will run into the same issue as I did since this block, unfortunately, does not support transfer lengths of 300 MB.

Yes, I suggest you go for the AXI DMA block from @alexforencich or similar. With his implementation, custom-tailored to control from PYNQ, I can capture several GBs of ADC samples at 5 GSPS rate.

Unfortunately, this is still unpublished research, but feel free to reach out to me at Alt text for more information.

xiangumass commented 6 months ago

Thank you for your detailed suggestion. @andreaskuster
Your post is probably the most relevant one to the design I am doing. I am also glad to see others doing something similar and you have successfully captured GBs at 5 GSPS, which is amazing. Yes, I think AXI DMA from Xilinx can only support 64 MB at most. I did check the DMA block from @alexforencich and am going to try it afterward. Especially, I notice it can be configured to disable Tlast. I really appreciate your help and will confirm with you by email to see if my thought is correct regarding the DMA block @alexforencich . My email is lx1993829 [at] gmail.com by the way.

Xilinx / RFSoC-PYNQ

Receiver DMA transfer length limit of 32768 #14