Open rachitnigam opened 1 year ago
Interesting! Yeah, certainly worth seeing if tacking on that additional register just solves it. We could also consider an explicit (Xilinx-specific) instantiation, as the warning messages suggests.
Might be interested to @andrewb1999 too
Yeah I'm planning on looking at this eventually, should be a straight forward fix.
Weirdly, even a Filament generated design seems to have this problem? Would be useful to figure out what behavioral RTL would generate the correct code or if we need to use something like Vivado's IP generator in the backend to make it work
Out of curiosity, I checked what Vitis HLS does. Here is an example multiplier generated by HLS that infers a DSP:
module mv_mul_32s_32s_32_5_1(clk, ce, reset, din0, din1, dout);
parameter ID = 1;
parameter NUM_STAGE = 4;
parameter din0_WIDTH = 14;
parameter din1_WIDTH = 12;
parameter dout_WIDTH = 26;
input clk;
input ce;
input reset;
input[din0_WIDTH - 1 : 0] din0;
input[din1_WIDTH - 1 : 0] din1;
output[dout_WIDTH - 1 : 0] dout;
reg signed [din0_WIDTH - 1 : 0] a_reg0;
reg signed [din1_WIDTH - 1 : 0] b_reg0;
wire signed [dout_WIDTH - 1 : 0] tmp_product;
reg signed [dout_WIDTH - 1 : 0] buff0;
reg signed [dout_WIDTH - 1 : 0] buff1;
reg signed [dout_WIDTH - 1 : 0] buff2;
assign dout = buff2;
assign tmp_product = a_reg0 * b_reg0;
always @ (posedge clk) begin
if (ce) begin
a_reg0 <= din0;
b_reg0 <= din1;
buff0 <= tmp_product;
buff1 <= buff0;
buff2 <= buff1;
end
end
endmodule
Surprisingly enough it took setting a pretty high frequency to get the tool to actually generate that. With a 100MHz target the tool only generates a single cycle multiplier (that still maps to a DSP), or even a combinational multiplier on a large ultrascale+ device with a low frequency target. The multiplier here hits nearly 500MHz on an ultrascale+ device.
Wow, that’s crazy. This would indicate we should switch our mult implementation to be 4 cycles.
Also, @andrewb1999 and I discussed the possibility to using the mult latency insensitive interface as a “virtual operation” and backing it up with different physical implementations depending on target device (in the style of #1151).
Wow, that’s crazy. This would indicate we should switch our mult implementation to be 4 cycles.
Also, @andrewb1999 and I discussed the possibility to using the mult latency insensitive interface as a “virtual operation” and backing it up with different physical implementations depending on target device (in the style of #1151).
Nice! Indeed, some manner of abstraction over different implementations here would be super duper cool. But as @rachitnigam has said in the past, this is much easier to imagine if we stay in the realm of sequential components and don't try to abstract over the combinational/sequential divide.
When running resource estimation, the
main_drc_routed.rpt
file provides the following warning:My guess is that registering the final output causes the implementation to be quite the same pipeline as a DSP48 resulting in the warning. A possible solution is adding another register in the pipeline so that the first three registers in the
mult_pipe
implementation are fully pipelined while the last one is responsible for sustaining the output after thedone
signal is set to high.