gatecat / nextpnr-xilinx

Experimental flows using nextpnr for Xilinx devices
ISC License
200 stars 37 forks source link

Placer fails with long CARRY4 chains #34

Open kazkojima opened 2 years ago

kazkojima commented 2 years ago

Hi,

A test case which is a 256-bit version of the carry stress test in SymbiFlow/f4pga-arch-defs

module top (
    input clk,
    output out
);
    wire [256-1:0] a;
    wire [256-1:0] b;
    wire [256-1:0] c;

    LFSR #(.POLY(5)) a_src(
        .clk(clk),
        .out(a)
        );

    LFSR #(.POLY(9)) b_src(
        .clk(clk),
        .out(b)
        );

    assign carry = 1;
    assign c = a + b + carry;
    assign out = c[256-1];
endmodule

module LFSR (
    input clk,
    output [256-1:0] out
    );
    parameter POLY = 1;

    reg [256-1:0] r = 1;
    assign out = r;
    wire f;
    assign f = ^(POLY & ~r);
    always @( posedge clk)
      r <= {r[256-2:0], ~f};
endmodule

which causes a hang up of HeAP placer. SA placer fails with

ERROR: failed to place chain starting at cell '$auto$alumacc.cc:485:replace_alu$1627.genblk1.slice[0].genblk1.carry4$split$muxcy0$PACKED_CARRY4$'
ERROR: Placing design failed.

It seems that all xc7 have no CARRY4 at grid_y which is a multiple of 26, as far as I've checked prjxray/database tilegrid.json files. gdb shows that that hang up happens in placer_heap.cc:legalise_placement_strict() looking for vertically contiguous chain of length > 25. The patch below works for me, though I can test it for one kintex-7 target only.

diff --git a/xilinx/pack_carry_xc7.cc b/xilinx/pack_carry_xc7.cc
index 85faa39d..a56f8407 100644
--- a/xilinx/pack_carry_xc7.cc
+++ b/xilinx/pack_carry_xc7.cc
@@ -240,7 +240,8 @@ void XC7Packer::pack_carries()
                 c4->constr_parent = root;
                 root->constr_children.push_back(c4);
                 c4->constr_x = 0;
-                c4->constr_y = -i / 4;
+       // Looks no CARRY4 on the tile of which grid_y is a multiple of 26. Skip them
+                c4->constr_y = -(i / 4 + i / (4*25));
                 c4->constr_abs_z = true;
                 c4->constr_z = BEL_CARRY4;
             }
@@ -345,7 +346,7 @@ void XC7Packer::pack_carries()
                 root->constr_children.push_back(s_lut);
                 s_lut->constr_parent = root;
                 s_lut->constr_x = 0;
-                s_lut->constr_y = -i / 4;
+                s_lut->constr_y = -(i / 4 + i / (4*25));
                 s_lut->constr_abs_z = true;
                 s_lut->constr_z = (z << 4 | BEL_6LUT);
             }
@@ -353,7 +354,7 @@ void XC7Packer::pack_carries()
                 root->constr_children.push_back(di_lut);
                 di_lut->constr_parent = root;
                 di_lut->constr_x = 0;
-                di_lut->constr_y = -i / 4;
+                di_lut->constr_y = -(i / 4 + i / (4*25));
                 di_lut->constr_abs_z = true;
                 di_lut->constr_z = (z << 4 | BEL_5LUT);
             }
programmerjake commented 2 years ago

We at Libre-SOC encountered this bug before, when trying to build microwatt, a workaround is to tell yosys to not generate CARRY4 at all with synth_xilinx -nocarry.

cc @toshywoshy

kazkojima commented 2 years ago

Thanks. That would certainly be a workaround. It would disable fast carry chains, which is a bit damaging for long bit-length additions, though. I ran into this issue with an x25519 calculation and -nocarry reduced the max clock freq. by ~1/3.

programmerjake commented 2 years ago

Luke (the person at Libre-SOC who originally encountered this bug) thinks this bug is best fixed fixed by making yosys split long carry4 chains. imho vpr and nextpnr-xilinx are where that would be handled since they know the details of how many carry4 blocks are available and should be able to insert the necessary routing between them. imho yosys is the wrong place to try to fix that.

additional context: https://libre-soc.org/irclog/%23libre-soc.2022-03-11.log.html#t2022-03-11T18:23:08

Contents of email Luke (lkcl) asked me to post ``` marzoul: fyi the same bug in symbiflow related to CARRY4, which i discussed with acomodi when this was the #symbiflow channel, is also present in nextpnr-xilinx gatecat ^ it does not show up when using RISC-V 32-bit cores because those never create carry-chains of more than 23-25 CARRY4 blocks only if you attempt a 64-bit core using e.g. a divide unit with 128-bit remainder/quotients e.g. in Power ISA 64-bit do you run smack into this problem toshywoshy, as discussed yesterday ^ the symptoms for symbiflow are that it bombs out with an error the symptoms for nextpnr-xilinx are that it completely locks up and fails to complete, going into an infinite loop as long as you stay below about 4x23-or-so (95) bit add, subtract or compare, you're "fine" the "workaround" is to use "-nocarry" to yosys "synth_xilinx" -nocarry do not use XORCY/MUXCY/CARRY4 cells in output netlist the proper solution is to fix the synth_xilinx techmap so that it splits anything above 95-or-thereabouts bits into separate adds/subs/cmps with an explicit carry-signal x[128], y[128] add(x, y) --> z[0:97] = add(x[0:96], y[0:96]) z[96:128] = add(z[96], x[96:128], y[96:128]) you get the general idea i'm sure ```
kazkojima commented 2 years ago

Ok if yosys can solve the problem during synthesis, though it seems more natural to resolve that during placement. I found that the split_carry_chain() in nextpnr/ecp5/pack.cc splits the carry chain into multiple legal chains based on the chip information. There appears to be a quite similar function in nextpnr/ice40/chains.cc.