Open Blebowski opened 6 months ago
Many modern compilers implement common subexpression elimination, which effectively makes this a non-issue. I suspect most common simulator and synthesis tools will do this (but unfortunately it is difficult to know for sure).
Yeah I agree with @jamesrbailey. Common sub-expression elimination is an extremely common optimization that I know is used extensively in all the modern HDL compilers I have used. AMD's Vivado calls it LUT combining, Design Compiler implements this in their opt stage, and similar for other popular toolchains.
Unless there is a benchmark you can share that clearly demonstrates a meaningful performance improvement, I would prefer to not modify this as it would add unnecessary complexity to the logic generator templating system.
Hi, the idea comes from experience with Vivado in 2021 (Pro edition). I was working on a block that had a VHDL clocked process decoding state of an FSM and controlling FSM outputs. The FSM was controlling a counter. In many states the counter was incremented:
counter_q <= counter_q + 1;
The "counter_q + 1" was scattered accross the FSM on maybe 30-40 places.
The only thing I did, was that I introduced a "counter increment" combo signal as a parallel assignment:
counter_increment <= counter_q + 1;
and replaced all the occurences of the "counter_q + 1" within the FSM with "counter_increment".
Without changing anything else (constraints, Vivado version, any synthesis options / strategy), I got around 700 LUTs less. The design in total was like 80 K LUTs so a very stupid change saved under 1 % of LUTs.
Therefore I always try to write the RTL in such way that it does not duplicate any comparison / arithmetic / AND,OR,XOR ... logic unless it is intended (e.g. some sort of redundancies). I think it is a good approach to try to write the RTL "as-if" schematic, and not rely on various optimizations.
Sure the example above was most likely caused by the complexity of the FSM, where the CSE most likely could not locate the common logic due to really enormous decoder of that FSM, but still.
I will try to benchmark this in DC some dummy generated register map that has big read-back stage and let you know.
Hi @amykyta3,
sorry for long time without reply. We were chasing tape-out, so I got to it only now.
We took one of our blocks, and we replaced ORDT generated register map with PeakRDL as trial. Then with couple of hand-optimizations in the spirit of above written, we tried to compare synthesis area before and after. SDCs, technology library stayed the same, so the only condition that changed were the "RTL Coding style" signals.
The RDL source and before / after version of the generated RTL:
dma_before.sv.txt dma_after.sv.txt dma_reg_map.rdl.txt
The synthesis via commercial ASIC synthesizer does differ slightly (first number is total area in um^2):
Before:
i_REG_MAP/genblk1_i_DMA 6249.3199 46.8 1645.8400 4452.2799 0.0000 dma_I_s_apb_apb3_intf__
After:
i_REG_MAP/genblk1_i_DMA 6207.0399 46.4 1603.2800 4452.5599 0.0000 dma_I_s_apb_apb3_intf__
the diff on this block is very small, about 1% of total area, so not very effective. Still, if the effort to do implement this would be negligible, I would go for it. Would you agree ?
I can download the Modelsim version that you were referring to, and run the regression suite too, so that I don't patches that break stuff.
When I applied these "hand-fixes" and also used async reset I get to :
i_REG_MAP/genblk1_i_DMA 6100.6400 46.0 1005.4800 4938.9200 0.0000 dma_I_s_apb_apb3_intf__
while ORDT AHB register map gives us something like:
i_REG_MAP/genblk1_i_DMA_PIO 5801.3200 46.7 3.6400 0.0000 0.0000 dma_pio
The 6800
vs 6100
is the most "apples to apples" comparison we could get.
We use PeakRDL 0.11.0 and reg-block that is packaged with it. The IFC is APB3.
The 300 gate difference is most likely OK for us, its just curiosity where does it come from.
PS: I was sending you a LinkedIn Invite. Could we have a chat about peakRDL ?
I am trying to run peakRDL regblok on a block that we are designing. Currently, we are using ORDT, but since the ORDT is not mainteined anymore, we might be looking for alternatives.
I see that the readback logic is not very optimal, e.g.
I see that on the each stage there is the same AND condition. The problem is that this creates the same logic multiple times during elaboration, and then also during initial technology mapping. The logic is then only optimized out due to resource sharing algorithms during the synthesis. I think this solution is sub-optimal. In my experience, the more you rely on resource sharing or synthesis optimizations to do their job, the longer the synthesis takes. If we chose peakRDL for our next ASIC project, and designed 20 blocks with it (1000s of flip flops in 100s of registers), such simple thing might be actually dozens of extra minutes of synthesis run-time.
I think instead the "common" logic should be abstracted to a temporary signal, e.g.:
This-way directly the RTL reflects the structure of logic as one would draw it on paper.
It might seem a small thing, but for large design this really results in better turn-around times.