The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.35k stars 481 forks source link

Question: ispd24 mempool_group 3M std cells require 85 cpu hours to complete pin access, where ? #5044

Closed stefanottili closed 1 week ago

stefanottili commented 2 weeks ago

Description

85 cpu hours to create pin access for 3M placed instances sounds like a lot. I'm trying to get a feeling what the runtime is spend on, but currently the individual steps outlined in the paper don't report their runtime. Who would know ? Any low hanging fruit to speed up this step ?

https://vlsicad.ucsd.edu/Publications/Journals/j133.pdf

A. Data preparation
...
3) Region Query: Region query is the data structure for fast shape queries. ...
4) LUT Generation:
5) Pin access analysis: For each pin, we generate at least K access points
(K = 3 in our implementation) using the pin access analysis methodology from [19].
An access point is an x-y coordinate on a metal layer where the detailed router ends routing.

Is this "for each pin" for each pin of every cell or for each pin of very instance ?

#scanned instances     = 3099210
#unique  instances     = 555
#stdCellGenAp          = 16776
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 11479
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 10978298
#instTermValidViaApCnt = 0
#macroGenAp            = 19262
#macroValidPlanarAp    = 19262
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)

Suggested Solution

No response

Additional Context

No response

maliberty commented 2 weeks ago

There are various phases that try to be efficient but I would guess the last one where specific groups of abutting blocks are examined will take the longest. It wouldn't be hard to add some timers. Do you have this test case setup in ORFS?

stefanottili commented 2 weeks ago

I just ran

openroad -gui -threads max ispd24.or

glancing at the code I see there is a lot more info when switching on verbose for drt, I'll try that next. And run some of the smaller testcases.

read_lef lef/Nangate.lef.gz
read_def def/bsg_chip.def.gz

set_routing_layers -signal metal2-metal8 -clock metal6-metal7

global_route -verbose
detailed_route
maliberty commented 2 weeks ago

Where do you obtain the test case?

stefanottili commented 2 weeks ago

One link leads to the next … https://ispd.cc/ispd2024/index.php https://liangrj2014.github.io/ISPD24_contest https://drive.google.com/drive/folders/1ocChoQupNxlLBH2hgqkPwR0-7D3ocTwm

maliberty commented 2 weeks ago

When I load this I don't see 3M placed instances:

[INFO ODB-0131]     Created 714938 components and 4540386 component-terminals.
[INFO ODB-0133]     Created 768239 nets and 2717737 connections.
maliberty commented 2 weeks ago

nvm I see your script is not for this test case. It would be a lot easier if you had just packaged a test case.

stefanottili commented 2 weeks ago

I'm sorry about that, bsg_chip is the next smaller of the testcases. It's showing the same behavior, but finishes pin access in 8 cpu hours with 9.5GB of memory.

It seems the "pin groups" processing is the step that takes a lot of runtime and memory. bsg_chip has 1/4 the components of mempool_groups, but only 1/10th the cpu time: 8 hours vs 84 hours

[INFO ODB-0131]     Created 714938 components and 4540386 component-terminals.
[INFO ODB-0133]     Created 768239 nets and 2717737 connections.
[INFO ODB-0134] Finished DEF file: def/bsg_chip.def.gz
..
[INFO DRT-0084]   Complete 668651 groups.
...
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 08:31:10, elapsed time = 00:24:01, memory = 9448.54 (MB), peak = 9971.07 (MB)
...
[INFO DRT-0199]   Number of violations = 0. (after 10 iterations)
...
[INFO DRT-0267] cpu time = 10:08:17, elapsed time = 00:29:32, memory = 22309.75 (MB), peak = 23984.83 (MB)
   3441163 def/ariane133_68.def.gz  COMPONENTS  120202 ;
   3496189 def/ariane133_51.def.gz  COMPONENTS  121794 ;
   3700625 def/mempool_tile.def.gz  COMPONENTS  128515 ;
   4652916 def/nvdla.def.gz         COMPONENTS  166393 ;
  19112975 def/bsg_chip.def.gz      COMPONENTS  714938 ;
  96402592 def/mempool_group.def.gz COMPONENTS 3099210 ;
 281212654 def/cluster.def.gz       COMPONENTS 9876330 ;
maliberty commented 1 week ago

How many threads are you using when you observe this runtime?

stefanottili commented 1 week ago

I’ve used openroad -threads max on a 16 core 32 thread 7950X. This utilizes all 32 threads, 100% cpu usage when running “pin groups” Ubuntu on wsl with 56GB mem + 14GB swap

stefanottili commented 1 week ago

Just one more datapoint for the bsg_chip running on a M1 (4+4) MacBook 16GB macOS Sonoma.

[INFO DRT-0166] Complete pin access. [INFO DRT-0267] cpu time = 06:50:42, elapsed time = 01:04:41, memory = 2266.61 (MB), peak = 6531.78 (MB)

[INFO DRT-0198] Complete detail routing. [INFO DRT-0267] cpu time = 16:22:52, elapsed time = 02:20:42, memory = 6144.62 (MB), peak = 6811.89 (MB)

Please note that on Mac openroad's memory reporting is way off, top and Activity Monitor reported usage of 16GB.

maliberty commented 1 week ago

I ran vtune and I did see one surprise in the profile. I'm testing a fix now and measuring the benefit.

maliberty commented 1 week ago

With the merge of #5050 please try again and report what you see. I saw some benefit but the cpu times I got don't match yours (perhaps different hw?).

maliberty commented 1 week ago

I have a second smaller one to look at.

stefanottili commented 1 week ago

Yay, quite some improvement.

Going from a loop over all named pins to look them up via an index clearly helped a lot.

Having said that, I'm pretty sure that something much better can be done algorithmically. It shouldn't take cpu hours to compute pin access points for 700K stdcells ...

7950X: 32 threads -> 4x faster cpu/elapsed
M1: 8 threads -> 2x faster cpu/elapsed.

7950X: < [INFO DRT-0267] cpu time = 08:31:10, elapsed time = 00:24:01, memory = 9448.54 (MB), peak = 9971.07 (MB) 7950X: > [INFO DRT-0267] cpu time = 02:17:07, elapsed time = 00:06:04, memory = 9413.39 (MB), peak = 9970.04 (MB)

M1: < [INFO DRT-0267] cpu time = 06:50:42, elapsed time = 01:04:41, memory = 2266.61 (MB), peak = 6531.78 (MB) M1: > [INFO DRT-0267] cpu time = 03:43:15, elapsed time = 00:33:14, memory = 4834.28 (MB), peak = 7545.33 (MB)

M1: top/Activity Monitor show 14GB memory usage. No idea why both run's reporting differs so much.

stefanottili commented 1 week ago

3M stdcells, 7950X 32 threads, 43 cpu, 1 1/2 elapsed hours for "pin access analysis" is still way too slow.

I'm still trying to wrap my head around how this algorithm ends up with 1339001 groups ...

[INFO DRT-0078]   Complete 15285 pins.
...
[INFO DRT-0081]   Complete 410 unique inst patterns.
...
[INFO DRT-0084]   Complete 1339001 groups.
#scanned instances     = 3099210
#unique  instances     = 555
#stdCellGenAp          = 16776
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 11479
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 10978298
#instTermValidViaApCnt = 0
#macroGenAp            = 19262
#macroValidPlanarAp    = 19262
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 43:41:24, elapsed time = 01:33:30, memory = 37107.12 (MB), peak = 39166.98 (MB)
stefanottili commented 1 week ago

Comparing ispd19_test6 with ispd24 bsg_chip (the latter using todays pin access fix)

180k instances 00:02:33 cpu time ispd19 test6.  ?? tech 
715k instances 03:43:15 cpu time ispd24 bsg_chip, Nangate45 tech

Why would a design with 4x the instances require 95x more time to compute pin access ? Some exponential runtime behavior in some part of the algorithm ? Maybe the OBS geometry in the fakerams ? Or just the different technologies rules or the MACRO PIN/OBS ?

[INFO DRT-0084]   Complete 172607 groups.
#scanned instances     = 179881
#unique  instances     = 138
#stdCellGenAp          = 4656
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 3312
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 790550
#instTermValidViaApCnt = 0
#macroGenAp            = 141404
#macroValidPlanarAp    = 141404
#macroValidViaAp       = 141392
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 00:02:33, elapsed time = 00:00:21, memory = 4391.97 (MB), peak = 4788.41 (MB)
[INFO DRT-0084]   Complete 668651 groups.
#scanned instances     = 714938
#unique  instances     = 428
#stdCellGenAp          = 11179
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 7726
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 2674937
#instTermValidViaApCnt = 0
#macroGenAp            = 41972
#macroValidPlanarAp    = 41972
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 03:43:15, elapsed time = 00:33:14, memory = 4834.28 (MB), peak = 7545.33 (MB)
maliberty commented 1 week ago

5051 is the next speedup (testing overnight).

stefanottili commented 1 week ago

One more datapoint: The pin access time of the mega boom groute testcase is just 47 sec for 1.7M std cells asap7. Is it correct to assume it's that fast because the utilization is so low that hardly any stdcell touch each other and thus there are no clusters ?

#scanned instances     = 1781981
#unique  instances     = 410
#stdCellGenAp          = 14360
#stdCellValidPlanarAp  = 120
#stdCellValidViaAp     = 11902
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 5787310
#instTermValidViaApCnt = 0
#macroGenAp            = 21821
#macroValidPlanarAp    = 21821
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 00:00:47, elapsed time = 00:00:10, memory = 5558.44 (MB), peak = 8123.77 (MB)
maliberty commented 1 week ago

With #5051 I don't see anymore low hanging fruit. The runtime is spent in the drc engine as expected. On a 32-cpu machine I see

[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 05:30:19, elapsed time = 00:10:33, memory = 18461.00 
(MB), peak = 20493.52 (MB)

The actual routing will take many elapsed hours so improving this further doesn't seem worthwhile.

maliberty commented 1 week ago

Is it correct to assume it's that fast because the utilization is so low that hardly any stdcell touch each other and thus there are no clusters ?

Most likely but I haven't looked. Your test case has unrealistically high placement density (>90%).

stefanottili commented 1 week ago

These are not "my test cases", these are the ispd24 global route contest test cases. "real world netlists, mapped to nangate45, placed by $$$ eda tool, no powergrid"

I agree that without power grid they have unrealistic high placement density. But they can be routed, so they're valid test cases.

Your two rounds of fixes (1. use index instead of for loop over all names comparing strings and 2. use a cache), caused the elapsed time go from three hours to 1.5 hours down to 5 min, great.

Pin access is using an awful lot of memory for 3M stdcells.

Here are the runtime for mempool_group.def which triggered this issue, case closed.

7950X 32 threads, 56GB memory

[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)
[INFO DRT-0267] cpu time = 43:41:24, elapsed time = 01:33:30, memory = 37107.12 (MB), peak = 39166.98 (MB) index
[INFO DRT-0267] cpu time = 03:00:23, elapsed time = 00:05:45, memory = 37104.12 (MB), peak = 39166.70 (MB) cache
maliberty commented 1 week ago

You're welcome