Timing driven global placement doesn't do a great job on critical multi pin nets

antonblanchard commented 2 years ago

I'm trying to improve the fmax of a multiplier. The critical path looks like:

Startpoint: _1590_ (falling edge-triggered flip-flop clocked by clk')
Endpoint: _2267_ (falling edge-triggered flip-flop clocked by clk')
Path Group: clk
Path Type: max

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                  0.00    0.00    0.00   clock clk' (fall edge)
                          0.00    0.00   clock network delay (ideal)
                  0.00    0.00    0.00 v _1590_/CLK (DFFLQNx1_ASAP7_75t_R)
                 12.55   33.80   33.80 v _1590_/QN (DFFLQNx1_ASAP7_75t_R)
     1    0.69                           _0765_ (net)
                 12.55    0.00   33.80 v _1553_/A (INVx1_ASAP7_75t_R)
                221.02   87.29  121.10 ^ _1553_/Y (INVx1_ASAP7_75t_R)
    14   34.01                           pp_row50_0 (net)
                227.27   20.34  141.44 ^ dadda_fa_0_38_0/A (FAx1_ASAP7_75t_R)
                 50.77   66.63  208.07 v dadda_fa_0_38_0/SN (FAx1_ASAP7_75t_R)
     1    0.79                           sn$86 (net)
                 50.77    0.01  208.08 v U$$1311/A (INVx1_ASAP7_75t_R)
                 25.47   20.48  228.56 ^ U$$1311/Y (INVx1_ASAP7_75t_R)
     1    2.24                           s$333 (net)
                 25.47    0.03  228.59 ^ dadda_fa_1_38_3/B (FAx1_ASAP7_75t_R)
                 23.07   37.21  265.80 v dadda_fa_1_38_3/SN (FAx1_ASAP7_75t_R)
     1    0.77                           sn$331 (net)
                 23.07    0.01  265.81 v U$$1487/A (INVx1_ASAP7_75t_R)
                 16.86   13.42  279.23 ^ U$$1487/Y (INVx1_ASAP7_75t_R)
     1    1.84                           s$749 (net)
                 16.86    0.07  279.30 ^ dadda_fa_2_38_2/CI (FAx1_ASAP7_75t_R)
                 30.16   32.53  311.83 ^ dadda_fa_2_38_2/SN (FAx1_ASAP7_75t_R)
     1    0.74                           sn$746 (net)
                 30.16    0.01  311.84 ^ U$$1721/A (INVx1_ASAP7_75t_R)
                 15.94   12.70  324.54 v U$$1721/Y (INVx1_ASAP7_75t_R)
     1    1.76                           s$1213 (net)
                 15.94    0.05  324.59 v dadda_fa_3_38_1/CI (FAx1_ASAP7_75t_R)
                 29.64   19.21  343.79 ^ dadda_fa_3_38_1/CON (FAx1_ASAP7_75t_R)
     1    0.77                           con$1209 (net)
                 29.64    0.01  343.80 ^ U$$1944/A (INVx1_ASAP7_75t_R)
                 18.06   14.23  358.03 v U$$1944/Y (INVx1_ASAP7_75t_R)
     1    2.23                           c$1563 (net)
                 18.06    0.02  358.05 v dadda_fa_4_39_0/B (FAx1_ASAP7_75t_R)
                 30.12   21.81  379.86 ^ dadda_fa_4_39_0/CON (FAx1_ASAP7_75t_R)
     1    0.82                           con$1560 (net)
                 23.32   14.61  394.48 v dadda_fa_4_39_0/SN (FAx1_ASAP7_75t_R)
     1    0.79                           sn$1561 (net)
                 23.32    0.01  394.49 v U$$2101/A (INVx1_ASAP7_75t_R)
                 19.70   15.04  409.54 ^ U$$2101/Y (INVx1_ASAP7_75t_R)
     1    2.34                           s$1851 (net)
                 19.71    0.27  409.80 ^ dadda_fa_5_39_0/CI (FAx1_ASAP7_75t_R)
                 31.58   34.20  444.00 ^ dadda_fa_5_39_0/SN (FAx1_ASAP7_75t_R)
     1    0.80                           sn$1848 (net)
                 31.58    0.02  444.02 ^ U$$2223/A (INVx1_ASAP7_75t_R)
                 10.22    8.46  452.47 v U$$2223/Y (INVx1_ASAP7_75t_R)
     1    0.70                           s$2670 (net)
                 10.22    0.01  452.48 v _2267_/D (DFFLQNx1_ASAP7_75t_R)
                                452.48   data arrival time

                  0.00  300.00  300.00   clock clk' (fall edge)
                          0.00  300.00   clock network delay (ideal)
                          0.00  300.00   clock reconvergence pessimism
                                300.00 v _2267_/CLK (DFFLQNx1_ASAP7_75t_R)
                         -9.87  290.13   library setup time
                                290.13   data required time
-----------------------------------------------------------------------------
                                290.13   data required time
                               -452.48   data arrival time
-----------------------------------------------------------------------------
                               -162.36   slack (VIOLATED)

Timing driven global placement is doing a good job in general (significantly better fmax than with it disabled). Looking at debug data out of global placement, we did identify this path as the worst and apply the highest weighting to all nets. All the nets in this critical path other than the high fan out pp_row50_0 are very short as we'd hope.

So what went wrong with pp_row50_0? Even though pp_row50_0 had the maximum timing driven weighting adjustment (x1.9), it still ends up being very large. This is pp_row50_0:

baseline

As an experiment, I scaled the timing driven weights by the number of pins on the net, with the hope of applying more "pressure" to those nets to reduce their size:

diff --git a/src/gpl/src/timingBase.cpp b/src/gpl/src/timingBase.cpp
index 0e5ccb056..201bdb154 100644
--- a/src/gpl/src/timingBase.cpp
+++ b/src/gpl/src/timingBase.cpp
@@ -178,7 +178,7 @@ TimingBase::updateGNetWeights(float overflow) {
         // weight(min_slack) = net_weight_max_
         // weight(max_slack) = 1
         float weight = 1 + (net_weight_max_ - 1)
-          * (slack_max - net_slack) / (slack_max - slack_min);
+          * (slack_max - net_slack) / (slack_max - slack_min) * (gNet->gPins().size()-1);
         gNet->setTimingWeight(weight);
         weighted_net_count++;
       }

This helped a lot:

net-degree-weighting

And the STA output after global placement shows a significant improvement (~25ps), most of which came from improvements in pp_row50_0:

==========================================================================
global place report_checks -path_delay max
--------------------------------------------------------------------------
Startpoint: _1590_ (falling edge-triggered flip-flop clocked by clk')
Endpoint: _2265_ (falling edge-triggered flip-flop clocked by clk')
Path Group: clk
Path Type: max

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                  0.00    0.00    0.00   clock clk' (fall edge)
                          0.00    0.00   clock network delay (ideal)
                  0.00    0.00    0.00 v _1590_/CLK (DFFLQNx1_ASAP7_75t_R)
                 12.56   33.81   33.81 v _1590_/QN (DFFLQNx1_ASAP7_75t_R)
     1    0.69                           _0765_ (net)
                 12.56    0.00   33.81 v _1553_/A (INVx1_ASAP7_75t_R)
                158.64   70.92  104.73 ^ _1553_/Y (INVx1_ASAP7_75t_R)
    14   25.16                           pp_row50_0 (net)
                158.64    0.60  105.33 ^ dadda_fa_0_36_0/A (FAx1_ASAP7_75t_R)
                 44.35   61.34  166.67 v dadda_fa_0_36_0/SN (FAx1_ASAP7_75t_R)
     1    0.86                           sn$72 (net)
                 44.35    0.03  166.69 v U$$1297/A (INVx1_ASAP7_75t_R)
                 23.03   18.56  185.25 ^ U$$1297/Y (INVx1_ASAP7_75t_R)
     1    2.04                           s$305 (net)
                 23.03    0.01  185.27 ^ dadda_fa_1_36_3/A (FAx1_ASAP7_75t_R)
                 32.53   43.39  228.66 v dadda_fa_1_36_3/SN (FAx1_ASAP7_75t_R)
     1    1.20                           sn$304 (net)
                 32.53    0.09  228.75 v U$$1471/A (INVx1_ASAP7_75t_R)
                 19.05   15.33  244.08 ^ U$$1471/Y (INVx1_ASAP7_75t_R)
     1    1.81                           s$719 (net)
                 19.05    0.06  244.15 ^ dadda_fa_2_36_2/CI (FAx1_ASAP7_75t_R)
                 31.41   34.15  278.30 ^ dadda_fa_2_36_2/SN (FAx1_ASAP7_75t_R)
     1    0.81                           sn$716 (net)
                 31.41    0.02  278.32 ^ U$$1709/A (INVx1_ASAP7_75t_R)
                 16.12   12.85  291.17 v U$$1709/Y (INVx1_ASAP7_75t_R)
     1    1.75                           s$1193 (net)
                 16.12    0.04  291.21 v dadda_fa_3_36_1/CI (FAx1_ASAP7_75t_R)
                 29.53   19.17  310.38 ^ dadda_fa_3_36_1/CON (FAx1_ASAP7_75t_R)
     1    0.76                           con$1189 (net)
                 29.53    0.01  310.39 ^ U$$1936/A (INVx1_ASAP7_75t_R)
                 18.49   14.50  324.89 v U$$1936/Y (INVx1_ASAP7_75t_R)
     1    2.32                           c$1553 (net)
                 18.49    0.06  324.95 v dadda_fa_4_37_0/B (FAx1_ASAP7_75t_R)
                 38.06   45.07  370.02 v dadda_fa_4_37_0/SN (FAx1_ASAP7_75t_R)
     1    1.44                           sn$1551 (net)
                 38.06    0.15  370.17 v U$$2097/A (INVx1_ASAP7_75t_R)
                 19.61   16.03  386.20 ^ U$$2097/Y (INVx1_ASAP7_75t_R)
     1    1.69                           s$1841 (net)
                 19.61    0.02  386.22 ^ dadda_fa_5_37_0/CI (FAx1_ASAP7_75t_R)
                 31.45   34.14  420.36 ^ dadda_fa_5_37_0/SN (FAx1_ASAP7_75t_R)
     1    0.80                           sn$1838 (net)
                 31.45    0.02  420.37 ^ U$$2219/A (INVx1_ASAP7_75t_R)
                  9.92    8.23  428.60 v U$$2219/Y (INVx1_ASAP7_75t_R)
     1    0.65                           s$2666 (net)
                  9.92    0.00  428.60 v _2265_/D (DFFLQNx1_ASAP7_75t_R)
                                428.60   data arrival time

                  0.00  300.00  300.00   clock clk' (fall edge)
                          0.00  300.00   clock network delay (ideal)
                          0.00  300.00   clock reconvergence pessimism
                                300.00 v _2265_/CLK (DFFLQNx1_ASAP7_75t_R)
                         -9.84  290.16   library setup time
                                290.16   data required time
-----------------------------------------------------------------------------
                                290.16   data required time
                               -428.60   data arrival time
-----------------------------------------------------------------------------
                               -138.44   slack (VIOLATED)

Any thoughts? While this highlights the issue, I'm not confident I'm fixing it in the right spot. For example, could global placement have an issue with multi pin nets in general (and not just in the timing driven path)?

Test case (I'm running it using ORFS): 32bit_4cycle_asap7_multiplier.tar.gz

QuantamHD commented 2 years ago

@antonblanchard can you run this same experiment with just the initial placement? It specifically does not include high fanout nets because it supposedly makes it harder to converge.

Maybe we could address that too. Also it looks like because this net is high fanout you're losing a lot to capacitance. Can you run a round of resizing on both and see if there's still an improvement?

antonblanchard commented 2 years ago

@antonblanchard can you run this same experiment with just the initial placement? It specifically does not include high fanout nets because it supposedly makes it harder to converge.

@QuantamHD did you mean disabling initial placement (-skip_initial_place)? I just tried that and it made things slightly worse. It looks like the default limit for initial placement is a fan out of 200 which this design is comfortably under.

Maybe we could address that too. Also it looks like because this net is high fanout you're losing a lot to capacitance. Can you run a round of resizing on both and see if there's still an improvement?

The somewhat interesting thing about this design is that the resizer is unable to do much to help (that's why I've ended up back in global placement looking for opportunities). It does add some fan out buffering, which helps but we still have the capacitance of a long wire to combat. Cell upsizing would normally improve this but it fails to here, I think because ASAP7 only has one size of full and half adder. It does try to resize the inverters in the path, but ultimately the delay calculations are all worse for those options.

If we had larger full and half adders I'm guessing we'd be able to upsize the cells in this entire path and somewhat overcome the extra capacitance on the high fanout net.

Having said all that, I do need to go back and see if there are improvements we can make to the resizer.

maliberty commented 2 years ago

Initial placement skips large nets as they tend not to be helpful rather than for convergence reason. The clock tree hasn't been inserted at this point so pulling all the registers together isn't useful. Likewise with other global nets like resets.

QuantamHD commented 2 years ago

@maliberty It feels like we could use the multicycle info / SDC info to ignore those nets in particular, but it sounds like it still might be helpful.

QuantamHD commented 2 years ago

Maybe we need to integrate the resizer into placement. Seems like that would make sense at least from a gate sizing perspective.

Perhaps adding a SA pass in the Placer to see if we could get any wins

antonblanchard commented 2 years ago

Here's an animated gif of net pp_row50_0 as we progress through global placement. Timing driven mode was disabled.

Things start off good, with all pins in the net remaining close together as the net moves left. At some point it goes bad, and the net slowly grows in size to become very large:

pp_row50_0

I don't understand how we create wire estimates for a multi pin net, but perhaps we have a bug there.

maliberty commented 2 years ago

We use an approximation of HPWL for the net length (the approximation is smoothly differentiable which HPWL is not). See 2.3 of https://cseweb.ucsd.edu/~jlu/papers/eplace-todaes14/paper.pdf - we use the WA metric.

maliberty commented 2 years ago

All the fanouts are ha/fa cells. That means they have 4/5 signal pins each of which is exerting a pull on the instance, not just this net. Contracting this net may mean lengthening many others. I suspect this is the main issue. The 1.9X cap means the critical net is limited in how hard it can pull. You might experiment with using a higher limit and see what happens overall.

All instances in the middle don't contribute to the hpwl and will have little incentive to move until they become near the bbox edge.

QuantamHD commented 2 years ago

@antonblanchard one other thing you might try is logic duplication. Which you should be able to try with just a few odb calls.

Try splitting one of the high fanout ha/fa's into 1-N duplicates cells driven by the same inputs. And break the net into N equal partitions. This is another optimization that should be in the resizer tool box that could help in this situation

maliberty commented 2 years ago

Note that multiple pins on this instance are critical and have high weight:

maliberty commented 2 years ago

Maybe we need to integrate the resizer into placement. Seems like that would make sense at least from a gate sizing perspective.

Timing driven replace calls the resizer in 'virtual' mode where we do the resize, estimate the timing, and then undo the resize. We want the timing to reflect the resizer but don't want to lock in the results too early.

maliberty commented 2 years ago

I suspect you would benefit from some io placement constraints. They are quite scattered.

QuantamHD commented 2 years ago

@maliberty Ah neat! Didn't know that, or maybe forgot. I think you or Tom mentioned that to me

maliberty commented 2 years ago

There is an argument for really resizing once you get near the end to incorporate the area impact of resizing. Its an open area for investigation.

antonblanchard commented 2 years ago

All the fanouts are ha/fa cells. That means they have 4/5 signal pins each of which is exerting a pull on the instance, not just this net. Contracting this net may mean lengthening many others. I suspect this is the main issue. The 1.9X cap means the critical net is limited in how hard it can pull. You might experiment with using a higher limit and see what happens overall.

I added an option to modify the multiplier in https://github.com/The-OpenROAD-Project/OpenROAD/pull/2301 . Testing various values shows that a much higher value (15) helps a lot. Note that this graph includes a multiplier of 0, which effectively disables timing mode.

graph

maliberty commented 2 years ago

@mgwoo how was 1.9 picked as the max net weight for TD?

antonblanchard commented 2 years ago

There is an argument for really resizing once you get near the end to incorporate the area impact of resizing. Its an open area for investigation.

I had a related thought - that it might be useful to apply the fan out modifications earlier (even before we start global placement). They don't need any placement parasitics to be calculated.

maliberty commented 2 years ago

There is an argument for really resizing once you get near the end to incorporate the area impact of resizing. Its an open area for investigation.

I had a related thought - that it might be useful to apply the fan out modifications earlier (even before we start global placement). They don't need any placement parasitics to be calculated.

My feeling is that they should only be used by synthesis and ignored in OR. They are just a rough guide and once you have placement you should start using the cap/slew checks which are more electrically meaningful.

maliberty commented 2 years ago

It leads to silliness like asking if an antenna diode counts as a fanout or not...

antonblanchard commented 2 years ago

My feeling is that they should only be used by synthesis and ignored in OR. They are just a rough guide and once you have placement you should start using the cap/slew checks which are more electrically meaningful.

ABC does do fan out modifications, but I'm not sure how we could easily keep them and strip all the other buffer modifications it does. If we had something like repair_design -fanout, we could do it right after remove_buffers and achieve a similar result.

maliberty commented 2 years ago

We don't keep the abc buffering in general so I'm not sure why it is more important here. I don't think we need to repair fanout at that early point and are better off doing repairs when we have something more realistic to base them off of.

QuantamHD commented 2 years ago

I'm generally not in favor of adding more complexity to the flow. It feels like a fanout fix should be a part of the gpl optimization loop if it thinks it would help

QuantamHD commented 2 years ago

Found an article from David Pan describing a scheme to dynamically weight nets on page 6 under Sensitivity Based Net Weighting https://www.cerc.utexas.edu/utda/publications/book_tdp.pdf

Perhaps this would yield better results than the static value

antonblanchard commented 2 years ago

There is an argument for really resizing once you get near the end to incorporate the area impact of resizing. Its an open area for investigation.

I had a quick look at this. As an experiment, I created a simple placement test:

global_placement ...

estimate_parasitics -placement

buffer_ports

repair_design

And modified it with another incremental global placement pass, and a repair_design pass in case that created violations:

global_placement ...

estimate_parasitics -placement

buffer_ports

repair_design

global_placement \
    -init_density_penalty 0.1 \
    -incremental \
    ...

repair_design

Critical path slack improved in all non trivial ASAP7 test cases:

        baseline  modified
aes     -292.34   -203.30
ethmac  -221.82   -134.33
ibex    -164.26   -121.51
jpeg    -157.23    -64.35
sha3    -370.17   -223.34

gcd and uart are trivial, but it's good to see that things didn't get worse:

gcd     -187.36   -184.09
uart    -119.17   -118.14

I presume we'd want to build this all into a single global placement pass, but this at least confirms there are some gains to be had. Also no idea what -init_density_penalty is, so I grabbed a value from one of the test cases.

QuantamHD commented 2 years ago

@arlpetergadfort didn't you mention to me at one point that the incremental placement could be much improved by no doing the initial placement again?

gadfort commented 2 years ago

@QuantamHD yeah, atleast something like that, in my experience incremental global placement with OpenROAD results in a repeat of an initial global placement (ironically takes longer because it's spending time undoing it's original placement). Ideally, incremental global placement would work without need for extra parameters that can be difficult to determine. I should clarify that it's not redoing the IP step, but just that all instances more or less end up moving back to their initial position and back out again.

The-OpenROAD-Project / OpenROAD

Timing driven global placement doesn't do a great job on critical multi pin nets #2297