calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
465 stars 47 forks source link

`mult_pipe` not inferred correctly pipelined for DSP inference #1175

Open rachitnigam opened 1 year ago

rachitnigam commented 1 year ago

When running resource estimation, the main_drc_routed.rpt file provides the following warning:

DPOP-3#1 Warning PREG Output pipelining DSP mult_pipe0/comp/out_tmp00 output mult_pipe0/comp/out_tmp00/P[47:0] is not pipelined (PREG=0). Pipelining the DSP48 output will improve performance and often saves power so it is suggested whenever possible to fully pipeline this function. If this DSP48 function was inferred, it is suggested to describe an additional register stage after this function. If the DSP48 was instantiated in the design, it is suggested to set the PREG attribute to 1. Related violations:

My guess is that registering the final output causes the implementation to be quite the same pipeline as a DSP48 resulting in the warning. A possible solution is adding another register in the pipeline so that the first three registers in the mult_pipe implementation are fully pipelined while the last one is responsible for sustaining the output after the done signal is set to high.

sampsyo commented 1 year ago

Interesting! Yeah, certainly worth seeing if tacking on that additional register just solves it. We could also consider an explicit (Xilinx-specific) instantiation, as the warning messages suggests.

rachitnigam commented 1 year ago

Might be interested to @andrewb1999 too

andrewb1999 commented 1 year ago

Yeah I'm planning on looking at this eventually, should be a straight forward fix.

rachitnigam commented 1 year ago

Weirdly, even a Filament generated design seems to have this problem? Would be useful to figure out what behavioral RTL would generate the correct code or if we need to use something like Vivado's IP generator in the backend to make it work

andrewb1999 commented 1 year ago

Out of curiosity, I checked what Vitis HLS does. Here is an example multiplier generated by HLS that infers a DSP:

module mv_mul_32s_32s_32_5_1(clk, ce, reset, din0, din1, dout);
parameter ID = 1;
parameter NUM_STAGE = 4;
parameter din0_WIDTH = 14;
parameter din1_WIDTH = 12;
parameter dout_WIDTH = 26;
input clk;
input ce;
input reset;
input[din0_WIDTH - 1 : 0] din0; 
input[din1_WIDTH - 1 : 0] din1; 
output[dout_WIDTH - 1 : 0] dout;

reg signed [din0_WIDTH - 1 : 0] a_reg0;
reg signed [din1_WIDTH - 1 : 0] b_reg0;
wire signed [dout_WIDTH - 1 : 0] tmp_product;
reg signed [dout_WIDTH - 1 : 0] buff0;
reg signed [dout_WIDTH - 1 : 0] buff1;
reg signed [dout_WIDTH - 1 : 0] buff2;

assign dout = buff2;
assign tmp_product = a_reg0 * b_reg0;
always @ (posedge clk) begin
    if (ce) begin
        a_reg0 <= din0;
        b_reg0 <= din1;
        buff0 <= tmp_product;
        buff1 <= buff0;
        buff2 <= buff1;
    end
end
endmodule

Surprisingly enough it took setting a pretty high frequency to get the tool to actually generate that. With a 100MHz target the tool only generates a single cycle multiplier (that still maps to a DSP), or even a combinational multiplier on a large ultrascale+ device with a low frequency target. The multiplier here hits nearly 500MHz on an ultrascale+ device.

rachitnigam commented 1 year ago

Wow, that’s crazy. This would indicate we should switch our mult implementation to be 4 cycles.

Also, @andrewb1999 and I discussed the possibility to using the mult latency insensitive interface as a “virtual operation” and backing it up with different physical implementations depending on target device (in the style of #1151).

rachitnigam commented 1 year ago

Wow, that’s crazy. This would indicate we should switch our mult implementation to be 4 cycles.

Also, @andrewb1999 and I discussed the possibility to using the mult latency insensitive interface as a “virtual operation” and backing it up with different physical implementations depending on target device (in the style of #1151).

sampsyo commented 1 year ago

Nice! Indeed, some manner of abstraction over different implementations here would be super duper cool. But as @rachitnigam has said in the past, this is much easier to imagine if we stay in the realm of sequential components and don't try to abstract over the combinational/sequential divide.