alexforencich / verilog-pcie

Verilog PCI express components
MIT License
1.1k stars 289 forks source link

dma_read_desc_status_valid not asserted when requesting memory read length > 8 #24

Open filamoon opened 2 years ago

filamoon commented 2 years ago

My tb is based on verilog-pcie and S10PcieDevice, with max_payload_size=0x1 and max_read_request_size =0x2:

    self.rc = RootComplex()
    self.rc.max_payload_size = 0x1  # 256 bytes
    self.rc.max_read_request_size = 0x2  # 512 bytes

    self.dev = S10PcieDevice()

I found that when I request a dma memory read of length 8 (m_axis_dma_read_desc_len=0x20), it is working fine and the output port dma_read_desc_status_valid from dma_if_rd is toggling.

But when I increase m_axis_dma_read_desc_len to 0x40, dma_read_desc_status_valid is stuck at 0.

Here is the log for the two cases. Please comment. Thanks.

=================== memory read length == 8, dma_read_desc_status_valid  asserted =====================
#   3760.00ns INFO     RX frame: S10PcieFrame(data=[0x00000008, 0x010000ff, 0x00000000], parity=[0x0, 0x0, 0x0], func_num=0, vf_num=None, bar_range=0, err=0)
#   3772.54ns INFO     Memory read, address 0x00000000, length 8, BE 0xf/0xf, tag 0
#   3789.14ns INFO     TX frame: S10PcieFrame(data=[0x4a000008, 0x00000020, 0x01000000, 0x00080000, 0x00000000, 0x00000800, 0x00000000, 0x00080800, 0x00000000, 0x00000800, 0x00000000], parity=[0x6, 0xe, 0x7, 0xb, 0xf, 0xd, 0xf, 0x9, 0xf, 0xd, 0xf], func_num=0, vf_num=None, bar_range=0, err=0)
#   3864.00ns INFO     RX frame: S10PcieFrame(data=[0x00000009, 0x0100013c, 0x00000000], parity=[0x0, 0x0, 0x0], func_num=0, vf_num=None, bar_range=0, err=0)
#   3876.54ns INFO     Memory read, address 0x00000000, length 9, BE 0xc/0x3, tag 1
#   3893.65ns INFO     TX frame: S10PcieFrame(data=[0x4a000009, 0x00000020, 0x01000102, 0x00080000, 0x00000000, 0x00000800, 0x00000000, 0x00080800, 0x00000000, 0x00000800, 0x00000000, 0x00081000], parity=[0x7, 0xe, 0x4, 0xb, 0xf, 0xd, 0xf, 0x9, 0xf, 0xd, 0xf, 0x9], func_num=0, vf_num=None, bar_range=0, err=0)
#   3972.00ns INFO     RX frame: S10PcieFrame(data=[0x00000008, 0x010002ff, 0x00000004], parity=[0x0, 0x0, 0x0], func_num=0, vf_num=None, bar_range=0, err=0)
#   3984.54ns INFO     Memory read, address 0x00000004, length 8, BE 0xf/0xf, tag 2
#   4001.14ns INFO     TX frame: S10PcieFrame(data=[0x4a000008, 0x00000020, 0x01000204, 0x00000000, 0x00000800, 0x00000000, 0x00080800, 0x00000000, 0x00000800, 0x00000000, 0x00081000], parity=[0x6, 0xe, 0x4, 0xf, 0xd, 0xf, 0x9, 0xf, 0xd, 0xf, 0x9], func_num=0, vf_num=None, bar_range=0, err=0)
#   4076.00ns INFO     RX frame: S10PcieFrame(data=[0x00000009, 0x0100033c, 0x00000004], parity=[0x0, 0x0, 0x0], func_num=0, vf_num=None, bar_range=0, err=0)
#   4088.54ns INFO     Memory read, address 0x00000004, length 9, BE 0xc/0x3, tag 3

=================== memory read length == 16, dma_read_desc_status_valid  not asserted =====================
#   3760.00ns INFO     RX frame: S10PcieFrame(data=[0x00000010, 0x010000ff, 0x00000000], parity=[0x0, 0x0, 0x0], func_num=0, vf_num=None, bar_range=0, err=0)
#   3772.54ns INFO     Memory read, address 0x00000000, length 16, BE 0xf/0xf, tag 0
#   3793.20ns INFO     TX frame: S10PcieFrame(data=[0x4a000010, 0x00000040, 0x01000000, 0x00080000, 0x00000000, 0x00000800, 0x00000000, 0x00080800, 0x00000000, 0x00000800, 0x00000000, 0x00081000, 0x00000000, 0x00000800, 0x00000000, 0x00081800, 0x00000000, 0x00000800, 0x00000000], parity=[0x6, 0xe, 0x7, 0xb, 0xf, 0xd, 0xf, 0x9, 0xf, 0xd, 0xf, 0x9, 0xf, 0xd, 0xf, 0xb, 0xf, 0xd, 0xf], func_num=0, vf_num=None, bar_range=0, err=0)
alexforencich commented 2 years ago

I'm going to need to see a waveform dump, otherwise I have no idea what's going on inside the DMA engine. Anything that I can open in gtkwave is fine (vcd, lxt, fst, etc.). But, extrapolating from your previous question, I'm assuming you're not using the dma_psdpram module on the other end, and hence you may be generating the write done signals incorrectly. You need to return a write done pulse for every write operation completed, for every segment. If the DMA engine doesn't see all of the write done indications that it's expecting, it will hang and will not indicate that the operation has completed.

filamoon commented 2 years ago

Thanks a lot for the quick reply! I didn't understand how the segmented ram interface works and only hooked up ram segment 0. Another question: what is the purpose ram_rd/wr_cmd_sel? Is it similar to AXI awid/arid?

alexforencich commented 2 years ago

So the idea with the select signal is to separate addressing from routing. In an AXI/AXI lite interconnect, the address lines are used both to select which device you're talking to (routing) as well as determining which register/memory location on that device you're talking to (addressing). This is fine if you want to have a nice, flat address space where one or more devices can communicate with several peripherals/memories. But for the DMA engine, you have a bunch of entities with their own local scratchpad RAMs that are issuing operations to the DMA engine targeting their own RAMs. So if the routing and addressing are kept separate, then the address can be preserved while the select signal is built up as requests from multiple devices are merged, and then used for routing on the return path for the actual memory read/write operations. And all of this can be configured automatically with verilog parameters (see the mux and demux modules). Sure, it would be possible to do this with a normal address space, but it would be a bit more of a pain to get everything set up correctly, and the effect would be the same - you would probably just use the upper N address lines exactly the way the select signal is wired up, except since it's concatenated to the address line it's more complicated to manage.

So, from the standpoint of using the DMA engine: generally all you should have to do is connect the select lines between the muxes and the DMA IF module, you don't have to drive any particular value on the select line at the "edge" where you make the request. For example, take a look at Corundum: the transmit and receive engines that are issuing DMA read and write operations don't even have ports for the select signal since the interconnect components will route all of the memory operations to the associated scratchpad RAM automatically. Likewise, the dma_psdpram module does not have a port for the select signal, because by the time the memory operation arrives at the RAM, there is no more routing to be done.

filamoon commented 2 years ago

Got it. Another Q, assuming pcie data width of 256, if only 256b aligned memory IO is done, is it ok to set RAM_SEG_COUNT = 1?

alexforencich commented 2 years ago

No, there must be at least 2 segments for the DMA engine to work correctly.

alexforencich commented 2 years ago

Also, in general, do not expect full-width, aligned transfers on the segmented interface. All the DMA engine is doing is taking the PCIe payloads, shifting them, and converting them to writes on the segmented interface. This will be affected by how the host generates the completion TLPs. If you need the data to be in contiguous blocks, use the DMA client modules to read it out of the segmented RAM after the whole DMA read operation is complete. Currently I only have client modules for AXI stream, but I'm planning on creating DMA clients for AXI, both master and slave, based on the ones for AXI stream. Or you can write your own DMA client to read from the segmented RAM.

chenbo-again commented 1 year ago

No, there must be at least 2 segments for the DMA engine to work correctly.

Hi Alex. I wonder why there must be at least 2 segments for the DMA, and it's hard to read the DMA code to understand why "The segmented memory interface is a better 'impedance match' to the PCIe hard core interface"? Is there any document about data re-alignment problem in PCIe hard cores?

alexforencich commented 1 year ago

It's not the hard core that has to deal with the alignment, it's the DMA engine. The need for segments is specifically to deal with wrap-around. Although, I suppose it's worth mentioning that PCIe hard cores can also have segmented interfaces themselves, which further compounds the problem.

The jist of it is this: the data in the PCIe TLP payloads is DWORD-aligned, while the data in the internal RAM is address-aligned. So at some point, the data has to be shifted into the correct alignment. When the data is shifted, you then have to deal with data wrapping off the end of the transfer cycle. Without a segmented interface, an additional transfer cycle is required for every packet where this wrap-around occurs. With a (double-width) segmented interface, the data simply shifts onto the other segment, so there is no throughput penalty and you always get 100% throughput.

For example, take a 128 bit bus (16 bytes). If that bus carries 16 bytes of data that need to be written to address 0x000A, then you need to shift it by 10 bytes, and you'll have 6 bytes to write to address 0x0000 and 10 bytes to write at 0x0010. Without a segmented interface, that means you'll need to perform two writes on two different clock cycles, one to address 0x0000 and one to address 0x0010. But if you set up a segmented interface correctly such that address bit 9 selects the segment, those accesses will always hit different segments so they can be issued in the same clock cycle.

The hit to throughput is particularly bad for PCIe specifically, for several reasons. First is that PCIe tends to have small packets and wide interfaces, with PCIe HIPs generally supporting 256 or 512 bit interfaces, and most systems only supporting 128 or 256 byte TLP payloads. Another reason is that the max TLP payload size is a power of two, so PCIe TLPs are effectively guaranteed to "pack" the full interface width (which is also a power of 2), meaning that any shift at all is going to result in taking the throughput hit. And I suppose I can also mention that, at least for Corundum, network packets tend to be written into memory with some headroom offset into each page, and the effect of this is that 100% of the transfers will need to be shifted (as opposed to the case where addresses and sizes are more random and the shift requirement is merely "high probability" rather than "all transfers", or potentially could be avoided through careful coding). The exact effect on the throughput is going to depend on the size of the transfers and the width of the bus, but if we assume 256 byte TLPs on a 512 bit interface, each TLP requires 256 * 8/512 = 4 cycles, so adding an extra cycle to handle the shifted data results in a throughput of 4/5 = 80%, or a 20% reduction in performance.

alexforencich commented 1 year ago

Also, one of the earlier Corundum developer meetings had a presentation on the DMA engine in this repo: https://youtu.be/lz_r01uvA6s?t=747 (slides are linked in the video description)