The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.5k stars 527 forks source link

Improve CTS for MegaBoom #5195

Open oharboe opened 3 months ago

oharboe commented 3 months ago

Description

MegaBoom has a BoomTile macro (CPU + L1, essentially) that has a single clock in and a very deep clock tree compared to the clock period. This BoomTile connects via a TileLink interface to the system busses.

The BoomTile was taped out in 28nm.

https://github.com/The-OpenROAD-Project/megaboom

image

To reproduce:

Clock clock
1594.44 source latency dcache/data/array_0_0_ext/R0_clk ^
-1278.74 target latency dcache/data/io_resp_0_0[69]$_DFF_P_/CLK ^
  10.00 clock uncertainty
 -12.67 CRPR
--------------
 313.02 setup skew

In similar test-cases that I can't share the clock network insertion latency is higher for the macros, and then the skew is higher after CTS.

The image below is from a test-case that I can share. There are 4 macros with ca. 1000ps clock insertion latency and the skew is on the order of 1000ps:

image

repair_timing after CTS in such cases doesn't move the needle.

Suggested Solution

Improve CTS

Additional Context

No response

maliberty commented 3 months ago

I seems like the skew is more of an issue then the insertion delay

oharboe commented 3 months ago

I seems like the skew is more of an issue then the insertion delay

Yes, but comparing the design I can't share with megaboom, sample of 2 :grimacing:, it looks like the skew increases with clock insertion latency of the macros in the design.

precisionmoon commented 3 months ago

We're looking into this. This is the clock tree immediately after CTS before repair_clock_nets. The clock period is 6500 ps and there are ~240K FF sinks and 123 macro pin sinks. The delay adjustment for the macro tree seems off. We'll provide more update by Fri.

image
precisionmoon commented 3 months ago

I can reproduce the latency of ~1600 ps and skew ~310 ps with the testcase.

Clock clock
1594.44 source latency dcache/data/array_0_0_ext/R0_clk ^
-1278.74 target latency dcache/data/io_resp_0_0[69]$_DFF_P_/CLK ^
  10.00 clock uncertainty
 -12.67 CRPR
--------------
 313.02 setup skew

From the clock tree viewer, I had the impression that the delay adjustment for the macro sink tree was excessive, so I tried disabling all delay buffer insertion with -delay_buffer_derate 0.0 option. This reduced the max latency but worsened the skew:

Clock clock
1453.76 source latency frontend/bpd/banked_predictors_1/tage/t_1/io_f3_resp_2_valid$_DFF_P_/CLK ^
-1020.67 target latency frontend/bpd/banked_predictors_0/bim/data_ext/R0_clk ^
  10.00 clock uncertainty
 -12.67 CRPR
--------------
 430.42 setup skew

If only half of the delay buffers are added (-delay_buffer_derate 0.5) , the skew can be reduced but the max latency doesn't change:

Clock clock
1453.76 source latency frontend/bpd/banked_predictors_1/tage/t_1/io_f3_resp_2_valid$_DFF_P_/CLK ^
-1087.05 target latency frontend/bpd/banked_predictors_0/bim/data_ext/R0_clk ^
  10.00 clock uncertainty
 -12.67 CRPR
--------------
 364.04 setup skew

In summary, the default tool setting seems to provide a reasonably balanced latency and skew. To improve the clock latency, you can try using only LVT cells for clock buffers and adding even stronger buffers beyond 24X.

As for the clock skew, if we can fix max trans violations in CTS, this may improve the skew. The tool is already inserting 13070 dummy loads to balance load caps.

Thanks.

oharboe commented 3 months ago

Silly question: why does clock latency matter?

Naively, from a performance point of view, for our design, it does not affect fMax. Skew matters for fMax.

Our situation is that of megaboom or designs/asap7/mock-cpu. There is a CPU core clock and a bus clock running at the same frequency, but there is an asynchronous clock crossing between the core clock and system bus clock domains, so clock latency of the core clock is invisible from the outside. there are no core pins that are relative to the core clock, all pins are relative to the system bus clock.

If there was a knob going from 0-1 where 0 is optimize skew only and 1 is latency only, then this knob would be 0.5 by default now. We would crank it to 0.

QuantamHD commented 3 months ago

@oharboe I'm not an expert in this area, but a longer clock tree usually implies more structure. Which means that you have more structures susceptible to on chip variation (OCV).

The liberty files aren't true to life, and in fact are essentially just sampled from one particular distribution.

oharboe commented 3 months ago

Makes sense: longer latency, more clock uncertainty. clock uncertainty contributes to skew, effectively... from a performance fmax point of view.