Closed rowanG077 closed 3 months ago
TL;DR: add -nowidelut
to get it to synthesise in 417 LUT4s, which is the number you expect here.
This is a classic case of delay-area tradeoff. There are some hints in the ABC9 output about what it's doing:
ABC: + &if -W 300 -v
ABC: K = 7. Memory (bytes): Truth = 0. Cut = 60. Obj = 140. Set = 636. CutMin = no
ABC: Node = 4426. Ch = 472. Total mem = 0.68 MB. Peak cut mem = 0.07 MB.
ABC: P: Del = 4041.00. Ar = 2030.0. Edge = 2363. Cut = 87308. T = 0.01 sec
ABC: P: Del = 4041.00. Ar = 1974.0. Edge = 2292. Cut = 86974. T = 0.01 sec
ABC: P: Del = 4041.00. Ar = 1742.0. Edge = 2290. Cut = 174418. T = 0.02 sec
ABC: F: Del = 4041.00. Ar = 1129.0. Edge = 1898. Cut = 87285. T = 0.01 sec
ABC: A: Del = 4041.00. Ar = 1036.0. Edge = 1717. Cut = 85301. T = 0.01 sec
ABC: A: Del = 4041.00. Ar = 1019.0. Edge = 1708. Cut = 85032. T = 0.01 sec
[snip]
ABC: + &ps -l
ABC: <abc-temp-dir>/input : i/o = 110/ 36 and = 3230 lev = 26 (14.42) mem = 0.04 MB box = 0 bb = 0
ABC: Mapping (K=7) : lut = 376 edge = 1708 lev = 8 (3.64) mem = 0.02 MB
ABC: LUT = 376 : 2=59 15.7 % 3=45 12.0 % 4=73 19.4 % 5=85 22.6 % 6=60 16.0 % 7=54 14.4 % Ave = 4.54
ABC9 will map first for minimum delay, then attempt to recover area. The lowest-delay solution it finds is 4.041ns, which needs a lot of large LUTs: 54 LUT7s, where each LUT7 is made up of 8 LUT4s (and ABC9 is aware of this; its best area is 1019 LUT4s).
Let's compare to the output with -nowidelut
.
ABC: + &if -W 300 -v
ABC: K = 4. Memory (bytes): Truth = 0. Cut = 48. Obj = 128. Set = 528. CutMin = no
ABC: Node = 4426. Ch = 472. Total mem = 0.63 MB. Peak cut mem = 0.06 MB.
ABC: P: Del = 4748.00. Ar = 480.0. Edge = 1728. Cut = 48240. T = 0.01 sec
ABC: P: Del = 4718.00. Ar = 495.0. Edge = 1828. Cut = 46528. T = 0.01 sec
ABC: P: Del = 4718.00. Ar = 461.0. Edge = 1644. Cut = 50842. T = 0.01 sec
ABC: F: Del = 4718.00. Ar = 422.0. Edge = 1516. Cut = 31685. T = 0.00 sec
ABC: A: Del = 4718.00. Ar = 418.0. Edge = 1433. Cut = 33285. T = 0.01 sec
ABC: A: Del = 4718.00. Ar = 417.0. Edge = 1431. Cut = 32689. T = 0.01 sec
[snip]
ABC: + &ps -l
ABC: <abc-temp-dir>/input : i/o = 110/ 36 and = 2514 lev = 26 (14.28) mem = 0.03 MB box = 0 bb = 0
ABC: Mapping (K=4) : lut = 417 edge = 1431 lev = 10 (4.31) mem = 0.02 MB
ABC: LUT = 417 : 2=71 17.0 % 3=95 22.8 % 4=251 60.2 % Ave = 3.43
Here the solution is slower - 4.718ns - but uses significantly less area (417 LUT4s) because it is not considering the large LUTs that are enabled by default.
As for -abc2
; it seems you've discovered a scalability problem in &fraig -x
; it might be worth reporting to ABC, although don't get your hopes up that it'll ever be fixed.
This actually turned out to be a catalyst for YosysHQ/abc#30 which fixes an issue in the &mfs
postprocessing pass. The downside of doing that is that &mfs
is...slow on this, likely because XORs are pretty awful.
So now the relevant numbers look like this: LUT7: 15 LUT4s fewer
Number of cells: 1707
L6MUX21 218
LUT4 1004
PFUMX 414
TRELLIS_FF 71
LUT4 (-nowidelut
): 9 LUT4s fewer
Number of cells: 479
LUT4 408
TRELLIS_FF 71
Thanks for the very clear explanation! I guess the only question I have is whether there is some way to enable/disable nowidelut
on specific nets or maybe even module? Disabling it wholesale is quite a sledgehammer if this is the only affected module.
There's no direct way to perform such a thing, because the LUT library passed to ABC9 is a global thing.
However, there are a few implicit ways to reduce area here:
abc9.D
, which allows you to set the critical path delay for the entire mapping to something slower than "best possible delay".
- related to the above, there's a poorly-documented scratchpad setting called
abc9.D
, which allows you to set the critical path delay for the entire mapping to something slower than "best possible delay".
Is this an actual API Yosys provides or just something that happens to work today?
It's an API that ABC provides and the Yosys side exposes, so if you're asking if it can be relied on, yes.
By the way, since it appears you wrote this code for liteeth, would you be okay with me including this in the Yosys-internal benchmark suite?
As a benchmark that benefits from &mfs
but initially exposes an ABC bug, it might be a useful smoke test to have in case the issue reappears.
Further, the heavy use of XOR makes this near-worst-case for an AND-Inverter Graph representation like ABC, and that contributes to a noticeable runtime from &mfs
, which could be tracked.
I'm fine with that. It is based on the existing CRC validator in liteeth so It's not fully my work. Pinging @enjoy-digital.
@rowanG077: No problem on my end.
Version
Yosys 0.39+124 (git sha1 d73f71e813d, g++ 12.2.0 -fPIC -Os)
On which OS did this happen?
Linux
Reproduction Steps
I was very surprised to see an extremely high logic usage for a CRC32 validator generated via litex. I have cut it down to the essentials. Synthesize the following module using
synth_ecp5 -top crc32_validator
Expected Behavior
I expect the circuit to synthesize to around 500 LUT4:
I'm definitely not a verilog expert so maybe some constructs are doing different things then I expect. But still over a 1000 LUT4 seems way off.
Actual Behavior
Yosys synthesizes it to a circuit with the following resource usage:
I also tried
-abc2
flag to try and optimize it further. While this helps a bit it's still way too large. In addition adding-abc2
results in a very long synthesis time. It's stuck on:Interestingly just the XOR tree does synthesize well:
result: