More optimized readback stage RTL generation

Blebowski commented 6 months ago

I am trying to run peakRDL regblok on a block that we are designing. Currently, we are using ORDT, but since the ORDT is not mainteined anymore, we might be looking for alternatives.

I see that the readback logic is not very optimal, e.g.

 assign readback_array[3][0:0] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_scb_en.value : '0;
    assign readback_array[3][1:1] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_cpb_en.value : '0;
    assign readback_array[3][2:2] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_macandd_en.value : '0;
    assign readback_array[3][3:3] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_spect_en.value : '0;
    assign readback_array[3][4:4] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_kdb_en.value : '0;
    assign readback_array[3][5:5] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_fss_en.value : '0;
    assign readback_array[3][6:6] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_otp_en.value : '0;
    assign readback_array[3][7:7] = (decoded_reg_strb.r_precharge_en && !decoded_req_is_wr) ? field_storage.r_precharge_en.f_mbist_en.value : '0;

I see that on the each stage there is the same AND condition. The problem is that this creates the same logic multiple times during elaboration, and then also during initial technology mapping. The logic is then only optimized out due to resource sharing algorithms during the synthesis. I think this solution is sub-optimal. In my experience, the more you rely on resource sharing or synthesis optimizations to do their job, the longer the synthesis takes. If we chose peakRDL for our next ASIC project, and designed 20 blocks with it (1000s of flip flops in 100s of registers), such simple thing might be actually dozens of extra minutes of synthesis run-time.

I think instead the "common" logic should be abstracted to a temporary signal, e.g.:

logic r_precharge_en_rd_en = decoded_reg_strb.r_precharge_en && !decoded_req_is_wr;
assign readback_array[3][0:0] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_scb_en.value : '0;
assign readback_array[3][1:1] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_cpb_en.value : '0;
assign readback_array[3][2:2] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_macandd_en.value : '0;
assign readback_array[3][3:3] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_spect_en.value : '0;
assign readback_array[3][4:4] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_kdb_en.value : '0;
assign readback_array[3][5:5] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_fss_en.value : '0;
assign readback_array[3][6:6] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_otp_en.value : '0;
assign readback_array[3][7:7] = (r_precharge_en_rd_en) ? field_storage.r_precharge_en.f_mbist_en.value : '0;

This-way directly the RTL reflects the structure of logic as one would draw it on paper.

It might seem a small thing, but for large design this really results in better turn-around times.

jamesrbailey commented 6 months ago

Many modern compilers implement common subexpression elimination, which effectively makes this a non-issue. I suspect most common simulator and synthesis tools will do this (but unfortunately it is difficult to know for sure).

amykyta3 commented 6 months ago

Yeah I agree with @jamesrbailey. Common sub-expression elimination is an extremely common optimization that I know is used extensively in all the modern HDL compilers I have used. AMD's Vivado calls it LUT combining, Design Compiler implements this in their opt stage, and similar for other popular toolchains.

Unless there is a benchmark you can share that clearly demonstrates a meaningful performance improvement, I would prefer to not modify this as it would add unnecessary complexity to the logic generator templating system.

Blebowski commented 6 months ago

Hi, the idea comes from experience with Vivado in 2021 (Pro edition). I was working on a block that had a VHDL clocked process decoding state of an FSM and controlling FSM outputs. The FSM was controlling a counter. In many states the counter was incremented:

counter_q <= counter_q + 1;

The "counter_q + 1" was scattered accross the FSM on maybe 30-40 places.

The only thing I did, was that I introduced a "counter increment" combo signal as a parallel assignment:

counter_increment <= counter_q + 1;

and replaced all the occurences of the "counter_q + 1" within the FSM with "counter_increment".

Without changing anything else (constraints, Vivado version, any synthesis options / strategy), I got around 700 LUTs less. The design in total was like 80 K LUTs so a very stupid change saved under 1 % of LUTs.

Therefore I always try to write the RTL in such way that it does not duplicate any comparison / arithmetic / AND,OR,XOR ... logic unless it is intended (e.g. some sort of redundancies). I think it is a good approach to try to write the RTL "as-if" schematic, and not rely on various optimizations.

Sure the example above was most likely caused by the complexity of the FSM, where the CSE most likely could not locate the common logic due to really enormous decoder of that FSM, but still.

I will try to benchmark this in DC some dummy generated register map that has big read-back stage and let you know.

Blebowski commented 1 month ago

Hi @amykyta3,

sorry for long time without reply. We were chasing tape-out, so I got to it only now.

We took one of our blocks, and we replaced ORDT generated register map with PeakRDL as trial. Then with couple of hand-optimizations in the spirit of above written, we tried to compare synthesis area before and after. SDCs, technology library stayed the same, so the only condition that changed were the "RTL Coding style" signals.

The RDL source and before / after version of the generated RTL:

dma_before.sv.txt dma_after.sv.txt dma_reg_map.rdl.txt

The synthesis via commercial ASIC synthesizer does differ slightly (first number is total area in um^2):

Before:

i_REG_MAP/genblk1_i_DMA            6249.3199     46.8  1645.8400  4452.2799  0.0000  dma_I_s_apb_apb3_intf__

After:

i_REG_MAP/genblk1_i_DMA            6207.0399     46.4  1603.2800  4452.5599  0.0000  dma_I_s_apb_apb3_intf__

the diff on this block is very small, about 1% of total area, so not very effective. Still, if the effort to do implement this would be negligible, I would go for it. Would you agree ?

I can download the Modelsim version that you were referring to, and run the regression suite too, so that I don't patches that break stuff.

When I applied these "hand-fixes" and also used async reset I get to :

i_REG_MAP/genblk1_i_DMA            6100.6400     46.0  1005.4800  4938.9200  0.0000  dma_I_s_apb_apb3_intf__

while ORDT AHB register map gives us something like:

i_REG_MAP/genblk1_i_DMA_PIO        5801.3200     46.7     3.6400     0.0000  0.0000  dma_pio

The 6800 vs 6100 is the most "apples to apples" comparison we could get.

We use PeakRDL 0.11.0 and reg-block that is packaged with it. The IFC is APB3.

The 300 gate difference is most likely OK for us, its just curiosity where does it come from.

PS: I was sending you a LinkedIn Invite. Could we have a chat about peakRDL ?

SystemRDL / PeakRDL-regblock

More optimized readback stage RTL generation #103