Closed zeusima closed 5 years ago
From IRC:
18:15 < daveshah> The LUT map ordering changed to accommodate abc9 18:16 < daveshah> This will cause the relut changes to no longer allow the LUT and carry to be packed together
@zeusima You can give https://pastebin.com/Uxc9HM5s a shot
pastebin from smunaut seems to give relief. The results without relut seem to be a slight bit faster but that might be specific for my design.
@smunaut Can you please file this as a PR?
@janrinze Note that there's a lot of noise in post-routing timing results, unfortunately, so small changes are generally not meaningful if you look at individual and not statistical change.
Yup working on submitting now.
Please note that the current master of yosys does no longer produce correct outputs for combinatorial loops on ice40. A testcase can be found in https://github.com/dpiegdon/ringoscillator - here multiple ringoscillators are used to generate test output (metastable, and oscillators with different delay lines). The bistream generated with the current master leads to a stable output on all of these. (i.e. no clock, no metastable output). If I revert the commit https://github.com/smunaut/yosys/commit/f28e38de9994151ea4e22608441dbc9e116d7b8c they work again.
In https://github.com/DurandA/verilog-buildingblocks/issues/1 @smunaut argues about the details. What I take from that is that maybe the optimizer (relut?) should be disabled until it has been replaced with a new version?
@dpiegdon What is "correct output" for a combinatorial loop? I don't think Yosys gives any guarantees on what they will synthesize into.
In any case, a ring oscillator must not be designed as a combinatorial loop but as a series of primitives marked with (*keep*)
if you want it to work, and a ring oscillator in FPGA fabric is not in fact guaranteed to work well (since they are prone to mode locking and interference) or at all (since it does not pass timing).
I am entirely against disabling a useful optimization pass (relut) in order to support a useless pattern (fabric ring oscillators written as combinatorial loops).
Let's be sensible: I disagree that fabric ring oscillators are useless, but instead an atypical pattern used in some unique applications, and one we should have no problems supporting. Now I don't think we should revert an optimisation that gives an improvement for most use cases, but instead try and get to the bottom of why it's broken and fix it so we can have our cake and eat it.
Let's approach it as we would any other regression: can you minimise the breaking design into a small circuit? What is the result with and without the bisected commit?
I disagree that fabric ring oscillators are useless, but instead an atypical pattern used in some unique applications
Sorry, my comment was badly written. I meant that inference of ring oscillators is, IMO, useless. Ring oscillators made from primitives with (*keep*)
should obviously keep working, and if that does not work then it is a bug in some pass.
It's a bit irrelvant anymay, #1290 removes relut and apparently works fine with his use case.
Mmm, digging into the logs posted on https://github.com/DurandA/verilog-buildingblocks/issues/1
Seems to show there is indeed an issue with this commit. Although it does restore the I0-I3 to their proper place the INIT string stays mangled ... (and this is on ring fully made up of manually instanciated LUT4)
It's a bit irrelvant anymay, #1290 removes relut and apparently works fine with his use case.
-relut
... (it removes ice40_unlut
) and should have no impact on this.EDIT: Sorry! I see that https://github.com/smunaut/yosys/commit/f28e38de9994151ea4e22608441dbc9e116d7b8c did change ice40_unlut
which would have corrupted unlut-ing of manually instantiated LUTs, which is the root cause of this, so very relevant in fact!
I will try and merge #1290 in the morning, but if someone wants to resolve the minor conflict and merge then please go ahead!
@eddiehung done.
I can confirm that with the current master all my testcases are working properly now. Thanks guys!!
Thanks everyone. For future more forgetful me and future generations, the post-mortem of this is thus:
The new abc9
techmapper introduced a little while ago is timing aware, and knows that each of not all LUT inputs have equal delay (some are more equal than others). It requires that fastest input be first, and monotonically increasing. However, in many architectures, the slowest input is typically the one with the lowest index (SB_LUT4.I0
). Since the original abc
techmapping pass treats all inputs as equal, I changed the techmap
pass to reverse the existing mapping of logical LUTs to physical LUTs: the first logical input to the last physical input.
The ice40 architecture is a little funky in that the carry chain mux is located on the input side of the LUT, rather than the output side. This means that extra care needs to be made to preserve the input pin ordering of LUTs full-adders during techmapping. The way this is done is that adders would instantiate SB_LUT4
cells, and these LUT cells would be treated as a blackbox module. After techmapping, the -relut
option would cause those SB_LUT4
physical cells to be transformed back into logical $lut
cells so that opt_lut
could then be used to, if possible, pack more logic into those adder structures.
Since I did not consider the -relut
when integrating abc9
, https://github.com/smunaut/yosys/commit/f28e38de9994151ea4e22608441dbc9e116d7b8c was necessary to cope with my new reverse pin order.
The problem with this instantiate-SB_LUT4-for-adders approach, however, is that it lost the ability to differentiate between SB_LUT4
s created from inferred adders, versus SB_LUT4
s manually instantiated by the designer, and thus ice40_unlut
was "unlut"-ing the manually instantiated LUTs incorrectly.
The solution, as part of a greater refactoring surrounding how we treat ice40 adders, was to not instantiate SB_LUT4
s for inferred adders, but instantiate an internal cell and to get rid of ice40_unlut
. This way, we never touch any manually instantiated SB_LUT4
yet still able to use opt_lut
on inferred logic.
Steps to reproduce the issue
Using commit dd8d264b:
ICESTORMLC: 1540/ 7680 20% Info: Max frequency for clock 'clk$glb_clk': 70.88 MHz (PASS at 12.00 MHz) Total path delay: 14.28 ns (70.02 MHz)
Using commit 463f7100:
ICESTORMLC: 1952/ 7680 Info: Max frequency for clock 'clk$glb_clk': 64.07 MHz (PASS at 12.00 MHz) Total path delay: 21.73 ns (46.02 MHz)
Same behaviour is observed with dd8d264b when -relut is enabled. Changing the placer algorithm or ABC setting makes no difference. The forcing of -relut to be always enabled highlighted this issue.
Expected behavior
An improvement to the utilisation and/or performance of the design should've been observed.
Actual behavior
The design shows higher utilisation and lower performance when -relut is enabled. This is consistent across multiple designs.