The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.54k stars 542 forks source link

Account for macro clock network latency in CTS optimization #3759

Closed oharboe closed 7 months ago

oharboe commented 1 year ago

Description

Otherwise, you can end up with a lot of skew

Suggested Solution

@maliberty You had some ideas?

Additional Context

No response

maliberty commented 1 year ago

There are two parts here

1) When building a .lib abstract for a block we should record the clock tree delay. The exact property is to be provided by My unless someone knows it offhand

2) When building the clock tree at the top level the latency that is inside the macro should be taken in to account.

maliberty commented 1 year ago

From https://www.eng.biu.ac.il/temanad/files/2017/02/Lecture-8-CTS.pdf image

This shows the idea though we don't want this command but to automate it through .lib.

maliberty commented 1 year ago

For the related .lib attributes:

max_clock_tree_path
Used in timing groups under a clock pin. Defines the maximum clock tree path constraint.
min_clock_tree_path
Used in timing groups under a clock pin. Defines the minimum clock tree path constraint.
maliberty commented 1 year ago

@louiic to write a more detailed spec

oharboe commented 1 year ago

@maliberty @tspyrou Is this more detailed spec in place?

maliberty commented 1 year ago

My suggested that the necessary data can be carried in Liberty. We are working on enhancing sta to generate the necessary data. After that we can start on CTS.

precisionmoon commented 8 months ago

We plan to enhance insertion delay support as follows: 1) Enhance H-Tree to pull macros with insertion delays ahead of FF leaves 2) Enhance OpenSTA to include insertion delay in report_clock_skew 3) Enhance clock tree viewer to include insertion delays (optional)

1 and #2 are targeted by end of Feb. If needed #3 can be done in mid March.

Thanks.

Cho

precisionmoon commented 8 months ago

@oharboe, one way to improve skew in the presence of FFs and macro cells (or macro cells with different insertion delays) is to balance latency by inserting additional buffers. If there is a path from one clock buffer to a FF and another to a macro cell, some additional buffers can be added along the path from the clock buffer to the FF to match the macro cell insertion delay. This can produce better clock skew at the expense of area and power. Is this an acceptable solution?

precisionmoon commented 8 months ago

Here's some intermediate update on the progress. Part 3: clock tree viewer enhancement is now complete. Macros sinks and register sinks are now colored differently and insertion delays are also included in macro cell pin arrival. Part 1: core HTree enhancement is about 60% complete. Even though the work of latency adjustment is incomplete, asap7 / mock-array shows promising results. Here's the current clock tree without the new enhancement. Red sinks are registers and dark cyan macros. image The clock tree changes as follows with the new enhancement (without -balance_levels option. image <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40">

sTNS / skew | CTS Pre repair | CTS Post repair | CTS Final | GR | Final | Final power -- | -- | -- | -- | -- | -- | -- No ins delay No bal levels | -48K 54.90 | -13.3K 71.01 | -21.4K 73.56 | 0.0 64.77 | 0.0 70.13 | 3.87 mW Ins delay No bal levels | -107K -27.55 | -3.6K -26.39 | -3.6K -41.57 | -104.07 -49.97 | 0.0 -51.58 | 3.59 mW

Even in the current incomplete state, setup TNS is 6X better after CTS final step and total power is 7% better after DR. The remaining work involves adding buffers to balance macro cell path latencies and register path latencies. This will be done by looking at actual path delays, not logic levels.

The project is on track for end of February delivery.

oharboe commented 8 months ago

@precisionmoon It would be nice to have the clock insertion point for the macro AND the clock insertion latency within the macro indicated in the CTS viewer. The clock insertion latency within the macro could perhaps be indicated a flip-flop after the macro clock insertion point?

The flip-flops of the current design and the flip-flops of the macros should then all line up with a minimum of skew(assuming we try to optimize for minimum of skew, the optimal skew is more complicated...).

precisionmoon commented 8 months ago

@oharboe, yes we can enhance the clock tree viewer to highlight the insertion delay. For now, we'd like to focus on the core task of improving the clock skew. We'll revisit clock tree viewer later. Is this OK?

oharboe commented 8 months ago

@precisionmoon I ran make DESIGN_CONFIG=designs/asap7/mock-array/config.mk, I can see changes in the rendering of the clock tree, but I don't see the the separation of macros and flip-flops at the root of the tree.

image

precisionmoon commented 8 months ago

@oharboe, did you enable -insertion_delay option? This is not enabled by default yet.

oharboe commented 8 months ago

@oharboe, did you enable -insertion_delay option? This is not enabled by default yet.

No. How do I enable it?

precisionmoon commented 8 months ago

You modify flow/scripts/cts.tcl.

On Mon, Feb 5, 2024 at 11:12 AM Øyvind Harboe @.***> wrote:

@oharboe https://github.com/oharboe, did you enable -insertion_delay option? This is not enabled by default yet.

No. How do I enable it?

— Reply to this email directly, view it on GitHub https://github.com/The-OpenROAD-Project/OpenROAD/issues/3759#issuecomment-1927859710, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBVEJRVDK5YO3NXD2SSYO6DYSEVKNAVCNFSM6AAAAAA3AF6O6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXHA2TSNZRGA . You are receiving this because you were mentioned.Message ID: @.***>

oharboe commented 8 months ago

@precisionmoon So I get this image below :+1:

What is this image telling me it telling me?

I tink it is telling me, from looking at the

image

image

image

precisionmoon commented 8 months ago

@oharboe,

Looking at an Element, the clock insertion latency is around 442.233ps. However, the clock insertion latency within the >Element is in addition to this.

No, the arrival at macro pin of 442.233 ps already includes insertion delay of ~70 ps for the Element. This is part of my recent GUI change.

cell ("Element") { interface_timing : true; pin("clock") { direction : input; clock : true; capacitance : 3.6238; timing() { timing_sense : positive_unate; timing_type : min_clock_tree_path; cell_rise(scalar) { values("70.88061"); } cell_fall(scalar) { values("72.71306"); } }

Now, I'm trying to balance the latencies by inserting buffers at the common point. This is more efficient than adding buffers along individual latency paths. Does this make sense?

oharboe commented 8 months ago

Yes. As we discussed, future GUI improvement would be a visiual clue/indication of min_clock_tree_path.

Looking forward to further improvements.

rovinski commented 8 months ago

Maybe one way to represent it is to stretch the cyan rectangle vertically? The top of the rectangle would be when the clock arrives at the port and then the bottom would be after the insertion delay. Like so

image

Although the registers don't do that and I don't think they could scale to that. Another option is to maybe drop a whisker like so?

image

precisionmoon commented 8 months ago

Delay buffer insertion feature has been checked in today as part of https://github.com/The-OpenROAD-Project/OpenROAD/pull/4607. Delay buffers are added such that average latencies of all macro cell paths and those of register cell paths can match. For example, if the difference in macro cell latencies and register cell latencies is 50 ps and each delay buffer delay is 10 ps, we need 5 delay buffers to balance the latencies. -delay_buffer_derate option has been added to control number of delay buffers. Default is 1.0, meaning all the intended delay buffers will be added. 0.0 means no delay buffers. Here are some prelim results on asap7/mock-array.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40">

sTNS / skew | CTS Pre repair | CTS Post repair | CTS Final | GR | Final | Final power -- | -- | -- | -- | -- | -- | -- Base No insertion delay | -48K 54.90 | -13.3K 71.01 | -21.4K 73.56 | 0.0 64.77 | 0.0 70.13 | 3.87 mW Ins delay 0 delay buffers -insertion_delay -delay_buffer_derate 0.0 | -107K -27.55 | -3.6K -26.39 | -3.6K -41.57 | -104.07 -49.97 | 0.0 -51.58 | 3.59 mW Ins delay 1 delay buffer -insertion_delay -delay_buffer_derate 0.2 | -106K -20.51 | -3.29K -21.57 | -3.2K -36.47 | -40.14 -45.99 | 0.0 -47.81 | 3.61 mW Ins delay 2 delay buffers -insertion_delay -delay_buffer_derate 0.5 | -107K -27.67 | -3.7K -28.08 | -3.7K -43.55 | -178.66 -52.72 | 0.0 -52.53 | 3.65 mW Ins delay 5 delay buffers -insertion_delay -delay_buffer_derate 1.0 | -110K -65.63 | -6.2K -66.36 | -6.2K -82.42 | -2338.12 -92.09 | -1382.6 -88.94 | 3.79 mW

Clock tree with 5 delay buffers looks as follows: image

Clock tree with 1 delay buffer looks as follows: image

This looks similar to the clock tree without any delay buffers because post CTS repair step adds some additional buffers.

Some more work is needed to tune the algorithm to produce the best skew and timing QoR.

precisionmoon commented 7 months ago

We enhanced the feature to support handling of macro trees with a single sink. Latency adjustment now happens for such scenarios also. image Also, the feature is enabled by default now, so -insertion_delay is no longer needed. To disable the feature, use "-no_insertion_delay" option. Since the PR https://github.com/The-OpenROAD-Project/OpenROAD/pull/4678 has been merged, we're closing this issue.