YosysHQ / yosys

Yosys Open SYnthesis Suite
https://yosyshq.net/yosys/
ISC License
3.48k stars 888 forks source link

Feature request: Optimize behavioral-like logic/arith expressions into LUTs when number of inputs is low enough #3084

Open marzoul opened 2 years ago

marzoul commented 2 years ago

Description

Yosys (or is it abc ?) could better optimize logic and arith expressions on LUT-based FPGA architectures. With the following very low-complexity pass:

When it is very visible that an expression, or even a hierarchical level from RTL, has only a few input bits, then complex logic/arith expressions are evaluated for all combinations of inputs, thus producing direct LUT configs.

Pertinent thresholds would be:

The interest of this is the following:

Reproducer

I found the issue with a generic behavioral implementation of a popcount component. Here is the behavioral code (in VHDL, but the actual input language is irrelevant to the request):

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity popcount is
    port (
        data : in  std_logic_vector(4 downto 0);
        num  : out std_logic_vector(2 downto 0)
    );
end entity;

architecture synth of popcount is

    function func_popcount(A : std_logic_vector) return natural is
        variable c : natural;
    begin
        c := 0;
        for i in A'low to A'high loop
            if A(i) = '1' then
                c := c + 1;
            end if;
        end loop;
        return c;
    end function;

begin

    num <= std_logic_vector(to_unsigned(func_popcount(data), 3));

end architecture;

The interest of the above behavioral code, for users, is that it is generic. So this style is a good choice to create a parameterized component with parameters for the size of inputs and outputs.

The following structural code better exhibits the LUTs that an experienced user would expect to get for such a 5-bits popcount. Obviously this is very cumbersome to write, and error-prone, and the manual methodology does no scale to large designs with lots of similar components.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity popcount is
    port (
        data : in  std_logic_vector(4 downto 0);
        num  : out std_logic_vector(2 downto 0)
    );
end entity;

architecture synth of popcount is

begin

    num <=
        std_logic_vector(to_unsigned(0, 3)) when data = "00000" else
        std_logic_vector(to_unsigned(1, 3)) when data = "00001" else
        std_logic_vector(to_unsigned(1, 3)) when data = "00010" else
        std_logic_vector(to_unsigned(2, 3)) when data = "00011" else
        std_logic_vector(to_unsigned(1, 3)) when data = "00100" else
        std_logic_vector(to_unsigned(2, 3)) when data = "00101" else
        std_logic_vector(to_unsigned(2, 3)) when data = "00110" else
        std_logic_vector(to_unsigned(3, 3)) when data = "00111" else
        std_logic_vector(to_unsigned(1, 3)) when data = "01000" else
        std_logic_vector(to_unsigned(2, 3)) when data = "01001" else
        std_logic_vector(to_unsigned(2, 3)) when data = "01010" else
        std_logic_vector(to_unsigned(3, 3)) when data = "01011" else
        std_logic_vector(to_unsigned(2, 3)) when data = "01100" else
        std_logic_vector(to_unsigned(3, 3)) when data = "01101" else
        std_logic_vector(to_unsigned(3, 3)) when data = "01110" else
        std_logic_vector(to_unsigned(4, 3)) when data = "01111" else
        std_logic_vector(to_unsigned(1, 3)) when data = "10000" else
        std_logic_vector(to_unsigned(2, 3)) when data = "10001" else
        std_logic_vector(to_unsigned(2, 3)) when data = "10010" else
        std_logic_vector(to_unsigned(3, 3)) when data = "10011" else
        std_logic_vector(to_unsigned(2, 3)) when data = "10100" else
        std_logic_vector(to_unsigned(3, 3)) when data = "10101" else
        std_logic_vector(to_unsigned(3, 3)) when data = "10110" else
        std_logic_vector(to_unsigned(4, 3)) when data = "10111" else
        std_logic_vector(to_unsigned(2, 3)) when data = "11000" else
        std_logic_vector(to_unsigned(3, 3)) when data = "11001" else
        std_logic_vector(to_unsigned(3, 3)) when data = "11010" else
        std_logic_vector(to_unsigned(4, 3)) when data = "11011" else
        std_logic_vector(to_unsigned(3, 3)) when data = "11100" else
        std_logic_vector(to_unsigned(4, 3)) when data = "11101" else
        std_logic_vector(to_unsigned(5, 3)) when data = "11110" else
        std_logic_vector(to_unsigned(5, 3)) when data = "11111" else

        std_logic_vector(to_unsigned(0, 3));

end architecture;

To reproduce, use the following command that uses the ghdl plugin:

yosys -m ghdl -p 'ghdl popcount.vhd -e popcount ; synth_xilinx -top popcount -family xc7 ; write_json popcount.json'

Expected behavior

This 5-bits popcount fits into just 3 LUT5. With both the behavioral and structural implementations. (We do obtain this with the structural implementation)

Actual behavior

Yosys generates 17 LUTs (+ a few muxf) with the behavioral implementation. Where obviously we know that this fits into just 3 LUT5.

marzoul commented 2 years ago

Just noticed, there is even a wrong output value (at least one) in my structural implementation...

marzoul commented 2 years ago

Yosys (or is it abc ?) struggles on the intermediate expressions, which in this case contains wide conditional additions. Dedicated arithmetic optimization could handle this (see #395).

But, my request is not about a dedicated arith optim. My request is precisely about implementing an optim pass that is independent from the logic and/or arith expressions that are used within the local hierarchical level.

Ravenslofty commented 2 years ago

So, for my part I'm against this idea: it's way too early to perform this pass, especially in the presence of more complicated logic that doesn't fit in a LUT. Or perhaps it turns out later that various inputs to a case or equivalent are unreachable, and the LUT we thought needed to be however big, could actually be much smaller.

I do agree Yosys should optimise this properly, and maybe I'll take a look, but my experience with FPGA mapping is that the best time to decide something is never and the second best time is right before you write the netlist out.

marzoul commented 2 years ago

Hi, thank you for your comment.

I agree with you for optimizations of arbitrary expressions (regardless of hierarchical boundaries). This could have undesired side-effects of hiding sub-expression sharing opportunities, if done too early and too arbitrarily.

But these issues don't exist for the other variant of the process that I propose (and what I am mostly interested in!). Which makes sense when synthesis is done with hierarchy kept (for instance, with command synth_xilinx without -flatten). There, for any module (hierarchical boundary), the proposed conditional optimization pass, done after all others, cannot worsen the design. Let me describe it a bit more detailed so the cost and benefit is more visible. Per-module, perform this:

Somehow it would be extremely useful, if as users, we could trust that the tool would try this optim. So keeping hierarchy would make sense not only for synthesis time: it would also provide more control over the results. Because on a design of mine, which is full of many instances similar to the reproducer I give in the issue description, the generated size is by default 380k LUTs. The only way I found to obtain better, is to manually rewrite the expressions within small modules, in order to "force" detection of small number of LUTs. And I get 240k LUTs. This is really significant!

Ravenslofty commented 2 years ago

Yosys doesn't preserve enough information to decide if this would be an improvement; such a lossless flow eats up memory, which is especially important to conserve on large designs such as yours.

The key thing here is "cannot worsen the design", which means Yosys must save that information so it can compare what ABC does on that module to what your proposed optimisation can do. Without this information preserved, it's simply too late to perform your optimisation at any stage other than really early on, which runs into the problem I mentioned earlier.

Also worth mentioning is that synth_xilinx is the only synthesis script that works hierarchically by default; almost all placement tools require flattened netlists, so flattening by default is the right thing to do. So now, your optimisation needs to run before the design is flattened, or else the module boundaries will be lost; without those it is near-impossible to keep track of what ABC has done to the netlist, as it performs very heavy netlist rewriting.

marzoul commented 2 years ago

I think you answer is mixing different points which don't have to be mixed.

I see no reason to believe that necessary information is lost for the proposed optim. Per-hierachy boundary, there is the result of ABC somewhere. Which is a small netlist of LUTs / carry / muxf / etc (explicitly handling only basic combinatorial primitives of the target technology). Which netlist, under the conditions listed in my previous comment, can be transformed into a potentially-smaller netlist, in terms of pure LUTs exclusively.

There may be something important that I am missing. Would you be able to indicate precisely which piece of information is lost throughout the flow, that would make the proposed optim not doable? If that is actually the case (that something is lost), then we should switch to discussing how to keep it. Because the genericity, simplicity and the potential benefit of the proposed optim would probably make it worth it. Even more, if necessary, I am ready to accept that an option is added to synth_xilinx, so that the proposed optim is the only synth/opt operation performed on the candidate modules.

See, I am ready to consider semi-extreme trade-offs, if that is what it would take to make yosys able to convert the 17 LUTs of my example (+carry and muxfs!) into the expected 3 LUT5 ;-)

Ravenslofty commented 2 years ago

No, these points are entirely relevant if you look at the bigger picture. As I mentioned: almost all Yosys synthesis flows are flattened by default. Those hierarchy boundaries are "the thing that is lost". That makes something like this optimisation extremely niche in practice because it requires a hierarchical flow.

Another thing I should mention is that it's difficult to help when examples are given in VHDL, because that has to go through GHDL, which is out of tree, and then perhaps there's a difference between versions of GHDL plugins across developers. I rewrote the popcount example into still-fairly-behavioural Verilog to see if I could reproduce the issue:

module popcount(input [4:0] data, output [2:0] num);

assign num = data[0] + data[1] + data[2] + data[3] + data[4];

endmodule

and then ran synth_xilinx -noiopad to get:

=== popcount ===

   Number of wires:                  8
   Number of wire bits:             24
   Number of public wires:           2
   Number of public wire bits:       8
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                  4
     CARRY4                          1
     LUT3                            2
     LUT4                            1

   Estimated number of LCs:          3

which is the 3 LUT5s you were talking about (and a carry4).

For all I know, the answer might be that GHDL is producing really strange input netlists that Yosys can't untangle; you would have to run GHDL and then write_rtlil netlist.il for me to see.

marzoul commented 2 years ago

The issue is reproduced in verilog with the following implementation. The interesting point is the genericity because it is parameterized.

module popcount(data, num);
    parameter size_in = 5;
    parameter size_out = 3;

    input  [size_in-1:0] data;
    output [size_out-1:0] num;

    integer n;
    integer i;

    always @(*) begin
        n = 0;
        for (i=0; i<size_in; i=i+1) begin
            if (data[i] != 0) n = n + 1;
        end
    end

    assign num = n;

endmodule

The result is indeed 17 LUTs:

   Number of cells:                 26
     CARRY4                          3
     LUT1                            6
     LUT2                            3
     LUT4                            5
     LUT6                            3
     MUXF7                           4
     MUXF8                           2

Note that this result is fragile, and a very tiny change makes it synthesized into 3 LUT5 like you obtained (+carry4). (replace if (data[i] != 0) n = n + 1 by n = n + data[i]) But perhaps in that result the LUTs are connected more in series, adding unwanted delay (not checked)... Anyway, a different functionality could have been there instead of popcount, without this variant in possible implementation to mitigate the area overhead (which depends on very specific tool behavior).

marzoul commented 2 years ago

When I try to read Yosys code, I tend to think that the improvement I propose is actually a task for ABC. Because it is ABC that is actually converting the design into LUTs, and doing LUT packing and optimization. This would require that ABC is also able to absorb functionality of other intermediate primitives into LUTs, for instance carry, muxf, why not combinational DSPs, division operations, etc. Anybody could confirm ?