Add support for using the DSP blocks on the iCE40UP5K?

mithro commented 6 years ago

The iCE40UP5K has DSP tiles which are described as "DSP blocks with multiply and accumulate functions". They are 16 x 16 Multiply & 32 bit Accumulator Blocks which can be chained together to be a 32x32 multiply block. It would be awesome if the VexRISCV could use these DSP blocks for better performance.

One issue is that Yosys can't infer the ice40 DSP blocks at the moment, so they will have to be manually instantiated.

Dolu1990 commented 6 years ago

There is currently two implementation, the iterative one and the one which use 4 DSP blocks (2x1616 + 2x1716) which fit not too bad into the regular Altera/Xilinx FPGA. (4 DSP blocks implementation there : https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/MulPlugin.scala#L74)

So, which architecture for the ICE40UP5K ? single DSP blocks reused 4 times ? or 4 DSP blocks running in // ?

I have to check the DSP blocks architecture.

mithro commented 6 years ago

I have no idea what would be the best solution, I think we might just have to try it?

This is a diagram from their documentation;

mithro commented 6 years ago

The iCE40 UltraPlus sysDSP can support the following functions:

8-bit x 8-bit Multiplier
16-bit x 16-bit Multiplier
16-bit Adder/Subtracter
32-bit Adder/Subtracter
16-bit Accumulator
32-bit Accumulator
8-bit x 8-bit Multiply-Accumulate
16-bit x 16-bit Multiply-Accumulate

ghost commented 6 years ago

I have been working on a reconfigurable ALU by blackboxing four DSP48E1 in Artix/Series7. The sysDSP is quite similar. I will share my code soon unless Dolu gets there quicker! I hope to make the ALU as a plugin for Vex.

Dolu1990 commented 5 years ago

I tried the MulPlugin which infer DSP by using the * operator with icecube2, it worked fine, and isn't a frequancy bottleneck. https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/MulPlugin.scala#L74 I don't currently know the behavior of icestorm on it.

I also tried to use the MulDivIterativePlugin, with as configuration a 4 time unrooled mul (8 cycle per mul, for a consumption of about 4x32 bits adder + 64 bits register, and it worked well, again without creating a fequancy bottleneck.

Else about handcrafting the mul instruction from DSP IP, the best solution i can see is the following : Sharing the logic with the divider to make RS2 unsigned (and flip the RS1 signe if required) which then lead to a signed 33 bits * unsigned 32 bits multiplication Then doing 4 unsigned multiplication between :

RS1(15 downto 0) * RS2(15 downto 0)
RS1(31 downto 16) * RS2(15 downto 0)
RS1(15 downto 0) * RS2(31 downto 16)
RS1(31 downto 16) * RS2(31 downto 16) Sum all of that with the required shift, and conditionaly add RS2 once on the result depending the RS1(33) bit.

There is the functional code without the DSP stuff : https://github.com/SpinalHDL/riscvSoftcoreContest/blob/053bee3f81311efdcdeaac048c07effd49aaae0f/hardware/scala/riscvSoftcoreContest/Up5kPerf.scala#L25

xobs commented 5 years ago

Yosys is able to partially infer DSP blocks now. However, it is not able to infer all of the DSP blocks. See https://github.com/im-tomu/fomu-tests/tree/master/mul-infer for an example.

tomverbeure commented 5 years ago

Splitting up the 32x32 multiplications into 4 16x16 isn't always the best solution. Often the FPGA synthesis tools are able to come up with a more efficient solution.

Here's my MulSimple plugin that does exactly that.

Whether or not it's applicable to this issue depends the abilities of the ICE40 synthesis tools though.

Tom

xobs commented 5 years ago

Splitting up the 32x32 multiplications into 4 16x16 isn't always the best solution. Often the FPGA synthesis tools are able to come up with a more efficient solution.

Here's my MulSimple plugin that does exactly that.

Whether or not it's applicable to this issue depends the abilities of the ICE40 synthesis tools though.

Tom

That doesn't work for Yosys, which isn't able to infer the 16-bit DSP at all:

Info: Device utilisation:
Info:            ICESTORM_LC:  7593/ 5280   143%
Info:           ICESTORM_RAM:    30/   30   100%
Info:                  SB_IO:    12/   96    12%
Info:                  SB_GB:     8/    8   100%
Info:           ICESTORM_PLL:     1/    1   100%
Info:            SB_WARMBOOT:     0/    1     0%
Info:           ICESTORM_DSP:     0/    8     0%
Info:         ICESTORM_HFOSC:     0/    1     0%
Info:         ICESTORM_LFOSC:     0/    1     0%
Info:                 SB_I2C:     0/    2     0%
Info:                 SB_SPI:     0/    2     0%
Info:                 IO_I3C:     0/    2     0%
Info:            SB_LEDDA_IP:     0/    1     0%
Info:            SB_RGBA_DRV:     0/    1     0%
Info:         ICESTORM_SPRAM:     4/    4   100%

tomverbeure commented 5 years ago

Ok, then that's something Yosys will need to add over time. It really depends on the Synthesis tool. For example, on some Altera devices, 4x 16x16 requires more DSP resources and 32x32, even if it should give the same result: both require 4 18x18 multipliers, but when inferred by Quartus, 2 of the 18x18 multipliers are packed into the same DSP, while the individual 16x16 cannot be packed into the same DSP, so the end result is 4 vs 3 DSPs used.

Tom

xobs commented 5 years ago

I've come up with this that works in simulation, comparing it against an inferred multiply:


    wire [0:31] int_lo_lo;
    wire [0:31] int_lo_hi;
    wire [0:31] int_hi_lo;
    wire [0:31] int_hi_hi;

    SB_MAC16 #(
        .NEG_TRIGGER(1'b0),
        .C_REG(1'b0),
        .A_REG(1'b0),
        .B_REG(1'b0),
        .D_REG(1'b0),
        .TOP_8x8_MULT_REG(1'b0),
        .BOT_8x8_MULT_REG(1'b0),
        .PIPELINE_16x16_MULT_REG1(1'b0),
        .PIPELINE_16x16_MULT_REG2(1'b0),
        .TOPOUTPUT_SELECT(2'b11),
        .TOPADDSUB_LOWERINPUT(2'b00),
        .TOPADDSUB_UPPERINPUT(1'b0),
        .TOPADDSUB_CARRYSELECT(2'b00),
        .BOTOUTPUT_SELECT(2'b11),
        .BOTADDSUB_LOWERINPUT(2'b00),
        .BOTADDSUB_UPPERINPUT(1'b0),
        .BOTADDSUB_CARRYSELECT(2'b00),
        .MODE_8x8(1'b0),
        .A_SIGNED(1'b0),
        .B_SIGNED(1'b0)
    ) mul16_lo_lo (
        .A(srca[16:31]),
        .AHOLD(1'b1),
        .B(srcb[16:31]),
        .BHOLD(1'b1),
        .C(16'h0000),
        .D(16'h0000),
        .CHOLD(1'b0),
        .DHOLD(1'b0),

        .IRSTTOP                    (1'b0), //keep hold register in reset
        .IRSTBOT                    (1'b0), //keep hold register in reset
        .ORSTTOP                    (1'b0), //keep hold register in reset
        .ORSTBOT                    (1'b0), //keep hold register in reset
        .OLOADTOP                   (1'b0), //keep unused signals quiet
        .OLOADBOT                   (1'b0), //keep unused signals quiet
        .ADDSUBTOP                  (1'b0), //unused
        .ADDSUBBOT                  (1'b0), //unused
        .OHOLDTOP                   (1'b0), //keep hold register stable
        .OHOLDBOT                   (1'b0), //keep hold register stable

        .CE(1'b1),
        .CLK(1'b0),

        // .CO(1'b0), .ACCUMCO(1'b0), .SIGNEXTOUT(1'b0),
        .CI(1'b0), .ACCUMCI(1'b0), .SIGNEXTIN(1'b0),

        .O(int_lo_lo[0:31])
    );

    SB_MAC16 #(
        .NEG_TRIGGER(1'b0),
        .C_REG(1'b0),
        .A_REG(1'b0),
        .B_REG(1'b0),
        .D_REG(1'b0),
        .TOP_8x8_MULT_REG(1'b0),
        .BOT_8x8_MULT_REG(1'b0),
        .PIPELINE_16x16_MULT_REG1(1'b0),
        .PIPELINE_16x16_MULT_REG2(1'b0),
        .TOPOUTPUT_SELECT(2'b11),
        .TOPADDSUB_LOWERINPUT(2'b00),
        .TOPADDSUB_UPPERINPUT(1'b0),
        .TOPADDSUB_CARRYSELECT(2'b00),
        .BOTOUTPUT_SELECT(2'b11),
        .BOTADDSUB_LOWERINPUT(2'b00),
        .BOTADDSUB_UPPERINPUT(1'b0),
        .BOTADDSUB_CARRYSELECT(2'b00),
        .MODE_8x8(1'b0),
        .A_SIGNED(1'b0),
        .B_SIGNED(1'b0)
    ) mul16_lo_hi (
        .A(srca[16:31]),
        .AHOLD(1'b1),
        .B(srcb[0:15]),
        .BHOLD(1'b1),
        .C(16'h0000),
        .D(16'h0000),
        .CHOLD(1'b0),
        .DHOLD(1'b0),

        .IRSTTOP                    (1'b0), //keep hold register in reset
        .IRSTBOT                    (1'b0), //keep hold register in reset
        .ORSTTOP                    (1'b0), //keep hold register in reset
        .ORSTBOT                    (1'b0), //keep hold register in reset
        .OLOADTOP                   (1'b0), //keep unused signals quiet
        .OLOADBOT                   (1'b0), //keep unused signals quiet
        .ADDSUBTOP                  (1'b0), //unused
        .ADDSUBBOT                  (1'b0), //unused
        .OHOLDTOP                   (1'b0), //keep hold register stable
        .OHOLDBOT                   (1'b0), //keep hold register stable

        .CE(1'b1),
        .CLK(1'b0),

        // .CO(1'b0), .ACCUMCO(1'b0), .SIGNEXTOUT(1'b0),
        .CI(1'b0), .ACCUMCI(1'b0), .SIGNEXTIN(1'b0),
        .O(int_lo_hi[0:31])
    );

    SB_MAC16 #(
        .NEG_TRIGGER(1'b0),
        .C_REG(1'b0),
        .A_REG(1'b0),
        .B_REG(1'b0),
        .D_REG(1'b0),
        .TOP_8x8_MULT_REG(1'b0),
        .BOT_8x8_MULT_REG(1'b0),
        .PIPELINE_16x16_MULT_REG1(1'b0),
        .PIPELINE_16x16_MULT_REG2(1'b0),
        .TOPOUTPUT_SELECT(2'b11),
        .TOPADDSUB_LOWERINPUT(2'b00),
        .TOPADDSUB_UPPERINPUT(1'b0),
        .TOPADDSUB_CARRYSELECT(2'b00),
        .BOTOUTPUT_SELECT(2'b11),
        .BOTADDSUB_LOWERINPUT(2'b00),
        .BOTADDSUB_UPPERINPUT(1'b0),
        .BOTADDSUB_CARRYSELECT(2'b00),
        .MODE_8x8(1'b0),
        .A_SIGNED(1'b0),
        .B_SIGNED(1'b0)
    ) mul16_hi_lo (
        .A(srca[0:15]),
        .AHOLD(1'b1),
        .B(srcb[16:31]),
        .BHOLD(1'b1),
        .C(16'h0000),
        .D(16'h0000),
        .CHOLD(1'b0),
        .DHOLD(1'b0),
        .IRSTTOP                    (1'b0), //keep hold register in reset
        .IRSTBOT                    (1'b0), //keep hold register in reset
        .ORSTTOP                    (1'b0), //keep hold register in reset
        .ORSTBOT                    (1'b0), //keep hold register in reset
        .OLOADTOP                   (1'b0), //keep unused signals quiet
        .OLOADBOT                   (1'b0), //keep unused signals quiet
        .ADDSUBTOP                  (1'b0), //unused
        .ADDSUBBOT                  (1'b0), //unused
        .OHOLDTOP                   (1'b0), //keep hold register stable
        .OHOLDBOT                   (1'b0), //keep hold register stable

        .CE(1'b1),
        .CLK(1'b0),

        // .CO(1'b0), .ACCUMCO(1'b0), .SIGNEXTOUT(1'b0),
        .CI(1'b0), .ACCUMCI(1'b0), .SIGNEXTIN(1'b0),
        .O(int_hi_lo[0:31])
    );

    SB_MAC16 #(
        .NEG_TRIGGER(1'b0),
        .C_REG(1'b0),
        .A_REG(1'b0),
        .B_REG(1'b0),
        .D_REG(1'b0),
        .TOP_8x8_MULT_REG(1'b0),
        .BOT_8x8_MULT_REG(1'b0),
        .PIPELINE_16x16_MULT_REG1(1'b0),
        .PIPELINE_16x16_MULT_REG2(1'b0),
        .TOPOUTPUT_SELECT(2'b11),
        .TOPADDSUB_LOWERINPUT(2'b00),
        .TOPADDSUB_UPPERINPUT(1'b0),
        .TOPADDSUB_CARRYSELECT(2'b00),
        .BOTOUTPUT_SELECT(2'b11),
        .BOTADDSUB_LOWERINPUT(2'b00),
        .BOTADDSUB_UPPERINPUT(1'b0),
        .BOTADDSUB_CARRYSELECT(2'b00),
        .MODE_8x8(1'b0),
        .A_SIGNED(1'b0),
        .B_SIGNED(1'b0)
    ) mul16_hi_hi (
        .A(srca[0:15]),
        .AHOLD(1'b1),
        .B(srcb[0:15]),
        .BHOLD(1'b1),
        .C(16'h0000),
        .D(16'h0000),
        .CHOLD(1'b0),
        .DHOLD(1'b0),

        .IRSTTOP                    (1'b0), //keep hold register in reset
        .IRSTBOT                    (1'b0), //keep hold register in reset
        .ORSTTOP                    (1'b0), //keep hold register in reset
        .ORSTBOT                    (1'b0), //keep hold register in reset
        .OLOADTOP                   (1'b0), //keep unused signals quiet
        .OLOADBOT                   (1'b0), //keep unused signals quiet
        .ADDSUBTOP                  (1'b0), //unused
        .ADDSUBBOT                  (1'b0), //unused
        .OHOLDTOP                   (1'b0), //keep hold register stable
        .OHOLDBOT                   (1'b0), //keep hold register stable

        .CE(1'b1),
        .CLK(1'b0),

        // .CO(1'b0), .ACCUMCO(1'b0), .SIGNEXTOUT(1'b0),
        .CI(1'b0), .ACCUMCI(1'b0), .SIGNEXTIN(1'b0),
        .O(int_hi_hi[0:31])
    );

    reg [0:31] result;
    assign out = result;
    always @(posedge clk)
    begin
        result <= int_lo_lo[0:31] + (int_lo_hi[0:31] << 16) + (int_hi_lo[0:31] << 16);
    end

We never actually use the int_hi_hi, so that could be removed. Curiously, the iCECube synthesis tools also use four SB_MAC16 blocks. I feel like there's something I'm missing here with regards to signed multiplication, but I'm throwing random values at it in simulation and the above block works just fine.

How difficult would it be to add this to VexRiscv?

xobs commented 5 years ago

Incidentally, I'm working on adding support to Yosys to just infer this when there is a 32-bit multiply, so this may become a non-issue.

Dolu1990 commented 5 years ago

Hi, I’m in holiday in a foreign country until friday, then i can check more in depth :) But what’s about MULH ?

xobs commented 5 years ago

Take your time!

Do you have an example of what the Verilog would look like, so I can come up with something that uses the SB_MAC16 blocks to accomplish that?

xobs commented 5 years ago

Actually, apparently Yosys is getting better inferrence for DSP blocks, so this may become a non-issue soon. MulSimple may end up being the correct approach once they have that working.

Dolu1990 commented 5 years ago

@xobs In which context are you using the mul instruction ? I mean, the goal is to get best performances ? Or having a good tradeoff between area and speed ?

xobs commented 5 years ago

The big reason is to gain RV32im support, which seems to be the most common ABI.

Eventually I would like to try getting the self-refilled MMU working as well, but one step at a time.

osresearch commented 3 years ago

I'm also trying to use the iCE40 DSP blocks to replace the multi-step, low-area picorv32 multiplier. This four SB_MAC16 design takes two signed or unsigned 32-bit values and produces a signed 64 bit output that passes the MUL, MULH, MULHU and MULHSU tests, as well as some integration tests using the monocypher crypto library. Routing it uses about the same LUTs as the picorv32_pcpi_mul module, but delivers results in a single clock.

It seems to me that there should be a better way to handle signed multiplication without doing a full 64-bit ~prod + 1 ripple add, although I haven't figured out how to do it yet. Maybe using the ADDSUBTOP and ADDSUBBOT bits to subtract from the accumulator?

module mult16x16(
        input clk,
        input [15:0] a,
        input [15:0] b,
        input [31:0] c,
        output [31:0] prod,
        output carry_out
);
`ifdef IVERILOG
        wire [32:0] ab = a * b;
        wire [32:0] mult = ab + c;
        wire [31:0] prod = mult[31:0];
        wire carry_out = mult[32];
`else
        SB_MAC16 #(
                .TOPOUTPUT_SELECT(2'b00), // adder, unregistered
                .TOPADDSUB_LOWERINPUT(2'b10), // multiplier hi bits
                .TOPADDSUB_UPPERINPUT(1'b1), // input C
                .TOPADDSUB_CARRYSELECT(2'b11), // top carry in is carry out from lower adder
                .BOTOUTPUT_SELECT(2'b00), // adder, unregistered
                .BOTADDSUB_LOWERINPUT(2'b10), // multiplier lo bits
                .BOTADDSUB_UPPERINPUT(1'b1), // input D
                .BOTADDSUB_CARRYSELECT(2'b00) // bottom carry in constant 0
        ) mult16x16_add(
                .CLK(clk),
                .A(a),
                .B(b),
                .C(c[31:16]),
                .D(c[15:0]),
                .O(prod),
                .CO(carry_out)
        );
`endif
endmodule

module mult32x32_signed(
        input clk,
        input [31:0] a,
        input [31:0] b,
        input a_sign,
        input b_sign,
        output [63:0] prod,
        output carry_out
);
        // todo: figure out the sign bit extension to avoid the extra plusses
        wire a_neg = a[31] & a_sign;
        wire b_neg = b[31] & b_sign;
        wire [31:0] a_pos = a_neg ? ~a + 1'b1 : a;
        wire [31:0] b_pos = b_neg ? ~b + 1'b1: b;
        wire [15:0] al = a_pos[15: 0];
        wire [15:0] bl = b_pos[15: 0];
        wire [15:0] ah = a_pos[31:16];
        wire [15:0] bh = b_pos[31:16];

        wire [31:0] al_bl;
        wire [31:0] al_bh;
        wire [31:0] ah_bl;
        wire [31:0] ah_bh;
        wire al_bl_carry;
        wire al_bh_carry;
        wire ah_bl_carry;

        // if both inputs are negative or positive, the result is positive,
        // so only if a XOR b is negative do we produce a negative result
        wire neg_result = (a_neg ^ b_neg);

        wire [63:0] prod_u = { ah_bh[31:0], al_bh[15:0], al_bl[15:0] };

        // ideally we could do the MAC with the carry, but that seems to fail
        assign prod = neg_result ? ~prod_u + 1'b1 : prod_u;

        mult16x16 mult_al_bl(
                .clk(clk),
                .a(al),
                .b(bl),
                .c(0),
                .prod(al_bl),
                .carry_out(al_bl_carry)
        );

        mult16x16 mult_ah_bl(
                .clk(clk),
                .a(ah),
                .b(bl),
                .c({16'h0000, al_bl[31:16]}), // al_bl_carry can never be set
                .prod(ah_bl),
                .carry_out(ah_bl_carry) // ah_bl_carry should never be set
        );

        mult16x16 mult_al_bh(
                .clk(clk),
                .a(al),
                .b(bh),
                .c(ah_bl),
                .prod(al_bh),
                .carry_out(al_bh_carry) // al_bh_carry *can* be set
        );

        // ah_bl_carry will never be set, but the second term might
        mult16x16 mult_ah_bh(
                .clk(clk),
                .a(ah),
                .b(bh),
                .c({15'h0000, al_bh_carry, al_bh[31:16]}),
                .prod(ah_bh),
                .carry_out(carry_out)
        );

endmodule

maikmerten commented 2 years ago

@osresearch Thanks for posting your code, this was very helpful! Starting from there, I experimented a bit and for now arrived at this code, which extends the 32-bit operands to 33-bits. For signed operations, the operands are sign-extended, for unsigned operations they are zero-extended. The extended operands are then fed into a signed multiplication. Yosys successfully infers four DSP blocks for this with "synth_ice40 -dsp" and the solution doesn't have carry-chains for inverting the two-complement representation.

`default_nettype none

`include "cpu/aludefs.vh"

module spu32_cpu_muldsp(
        input I_clk,
        input I_en,
        input I_reset,
        input[3:0] I_op,
        input[31:0] I_s1,
        input[31:0] I_s2,
        output[63:0] O_result,
        output O_busy
    );

    // single-cycle multiplication with inferred DSP blocks
    assign O_busy = 0;

    // determine sign/zero-extension for operands
    reg s1_extension, s2_extension;
    always @(*) begin
        s1_extension = 1'b0;
        s2_extension = 1'b0;
        case(I_op)
            `ALUOP_MULH: begin
                s1_extension = I_s1[31]; // sign-extend s1
                s2_extension = I_s2[31]; // sign-extend s2
            end
            `ALUOP_MULHSU: begin
                s1_extension = I_s1[31]; // sign-extend s1
            end
        endcase
    end

    wire[63:0] mul_signed = $signed({s1_extension, I_s1}) * $signed({s2_extension, I_s2});
    assign O_result = mul_signed;

endmodule

This passes the multiplication test vectors from the RISC-V test suite, so I'm somewhat hopeful things may be correct (as they should be, unless I'm missing a gotcha regarding two-complement multiplication). Perhaps this works for you as well?

osresearch commented 2 years ago

I should have posted a followup... since mainline yosys with -dsp now infers multipliers and Claire merged my changes to enable a DSP multiply instructions for the picorv32, for that application it is no longer necessary to hand-code the hard blocks. https://github.com/YosysHQ/picorv32/pull/202

Dolu1990 commented 2 years ago

@osresearch Thanks ^^

SpinalHDL / VexRiscv

Add support for using the DSP blocks on the iCE40UP5K? #40