Efficient narrow AXI write in HLS

lastweek commented 5 years ago

Hi,

Thank you for sharing this code, it's very informative and useful. I was wondering if you guys had the requirement to do narrow AXI write, if so, what's your solution to do this efficiently?

The reason I'm asking this is I found the HLS compiler is too dumb to figure certain patterns. For example, the compiler can't figure out the nr * 8 - 1 is 0x111 aligned, so write strobe can NOT be used. Instead, the generated code will be a AXI read followed by a AXI write.

void foo(ap_uint<64> *dram, int nr)
{
    ap_uint<64> tmp;
    dram[0](nr * 8 - 1, 0) = tmp(nr * 8 - 1, 0);
}

My current workaround is a big switch case like:

switch (nr) {
case 1: dram[0](7,0) = tmp(7, 0); break;
case 2: dram[0](15,0) = tmp(15,0); break;
...
}

But this produces too much unnecessary logic. Have you guys dealt with similar issue before? Thank you.

haggaie commented 5 years ago

Well, we tried using HLS-generated AXI4-MM for DDR access on our board. We only used aligned accesses, so we didn't encounter this issue, but I can see your point. I'm not sure whether the DDR controller we use supports strobed writes.

Either way, we found that the HLS-generated AXI4 master limited the throughput even with aligned accesses, as it limited the number of outstanding memory reads for some reason. We replaced the generated interface with an asynchronous interface: a simplified version of the 5 AXI4-Stream interfaces that generate requests and responses to memory, which we tie into an AXI4-MM interface in a Verilog wrapper. Using this method would allow us to send custom strobe values as well, but as I wrote above we haven't needed it so far.

lastweek commented 5 years ago

Thanks for sharing.

For the a simplified version of the 5 AXI4-Stream interfaces that generate requests and responses to memory, which we tie into an AXI4-MM interface in a Verilog wrapper.: is it something like a middleman between the HLS IP and memory controller? AXI-Stream will have some predefined format (header + data), and this Verilog wrapper will parse the AXI-Stream packet and then generate AXI-MM requests to memory controller?

haggaie commented 5 years ago

Yes, that's more or less it. We used something along the lines of this interface:

template <size_t _interface_width>
class memory {
public:
    enum {
        interface_width = _interface_width,
    };
    typedef ap_uint<512> value_t;
    typedef ap_uint<interface_width - 6> index_t;

    hls::stream<index_t> ar;
    hls::stream<value_t> r;
    hls::stream<index_t> aw;
    hls::stream<value_t> w;
    hls::stream<bool> b;
};

We only did fully aligned 512-bit reads and writes, so it is easy for the Verilog code to add the necessary auxiliary signals and make it an AXI4-MM interface.

lastweek commented 5 years ago

Just curious, if I use innova to handle some IB requests, partially aligned 512-bits may happen, right? Or it is just the case innova is avoiding?

haggaie commented 5 years ago

The restriction to use only 512-bit aligned writes comes from the application we implemented, which is a key value store for small keys and values. We chose to implement it using fully aligned accesses to simplify the implementation.

The Innova card we used doesn't support InfiniBand, but it does support RDMA with RoCE. Is that what you meant? In any case, I'm not sure whether or not the DDR controller in the Innova's shell IP supports narrow writes, but it is a property of the DDR controller, and not related to how requests from the network are handled. I tried checking the Innova user manual again, and it says that addresses must be dword aligned, and strobe must be aligned to address. Does that help?

lastweek commented 5 years ago

I see what you mean. Dealing with narrow writes is not something related to RDMA requests, but rather an application-specific behavior. Not having narrow writes truly can simplify the HLS-based implementation, I feel the same way.

Thank you for sharing this with me. We are developing a SmartNIC similar system using Xilinx FPGA (we are also using Xilinx's MC, it supports narrow strobed AXI write). Rather than using Innova to leverage existing RDMA network, we developed our own network stack. There are many reasons of not using Innova, part of it is we fail to find enough documentation about how exactly Innova works (e.g., where FPGA takes over, how is on-board/host memory used etc). That's another topic.

Thank you @haggaie.

haggaie commented 5 years ago

I'm happy to help. If you have any more questions on the Innova, you can also email me.

lastweek commented 5 years ago

Hi @haggaie,

Your HLS paper and this repo is amazing and so practical. I wish I've seen your repo earlier.

The reason I came back and posting here is because these days I had very bad experience with AXI-MM in HLS. Originally, I used native AXI-MM interface in HLS. Then I found it's so fragile and performs so bad, I moved to using AXI-Stream + Xilinx Datamover. The datamover will translate AXI-Stream to AXI-MM. Doing so, the HLS code is able to pipeline asynchronously. This is quite similar to what you've described above. But today I found the Datamover's performance is not good either, as in the output back-to-back AXI-MM transaction always have a fixed margin in the middle.

I was wondering that have you guys tried out the Datamover thing? Did you guys decide to implement 5 AXI-Stream channel because it can't provide good performance?

Btw, I think the 5 AXI-Stream channel code and the verilog wrapper you mentioned are not included in the repo. Could you share them by any chance? Thank you.

haggaie commented 5 years ago

Your HLS paper and this repo is amazing and so practical. I wish I've seen your repo earlier.

Thanks!

I was wondering that have you guys tried out the Datamover thing? Did you guys decide to implement 5 AXI-Stream channel because it can't provide good performance?

We haven't tried the Datamover. I think I did look at the documentation, but eventually we decided on the simpler solution because we didn't need all the extra features that the Datamover IP provided, like bursts and realignment.

However, I think the Xilinx HLS examples for TCP/IP and memcached did use it, as well as the ETH Zurich TCP/IP stack.

Btw, I think the 5 AXI-Stream channel code and the verilog wrapper you mentioned are not included in the repo. Could you share them by any chance? Thank you.

Yes, I haven't added them here yet. I'll try to do that soon.

lastweek commented 5 years ago

Right, we tried the Xilinx memcached one, I found the same issue on the datamover side, it fail to reach its designed line rate (10Gbps). Your paper also evaluated memcached, I suppose you guys replace its datamover interface with your own AXI-Stream channels?

Thank you for sharing, I will get notifications when you do.

haggaie commented 5 years ago

We wrote our version of memcached from scratch, because it serves a slightly different use-case (it only caches part of the dataset while the Xilinx example implements the full service on the FPGA). We also limited our keys and value sizes for simplicity.

lastweek commented 5 years ago

Make sense. I also implemented a similar KVS with fixed key/value size recently, much easier compared to variable size key/value case.

haggaie commented 3 years ago

I believe the original question was answered (you can find the memory interface in ntl (8c2df15c70abd99cc9c7fa8fa84286a0de16960c) and an example of how they were used in Verilog in NICA, so I'm closing this issue. Let me know if you need anything else.

acsl-technion / ntl

Efficient narrow AXI write in HLS #1