Closed stefanottili closed 1 week ago
There are various phases that try to be efficient but I would guess the last one where specific groups of abutting blocks are examined will take the longest. It wouldn't be hard to add some timers. Do you have this test case setup in ORFS?
I just ran
openroad -gui -threads max ispd24.or
glancing at the code I see there is a lot more info when switching on verbose for drt, I'll try that next. And run some of the smaller testcases.
read_lef lef/Nangate.lef.gz
read_def def/bsg_chip.def.gz
set_routing_layers -signal metal2-metal8 -clock metal6-metal7
global_route -verbose
detailed_route
Where do you obtain the test case?
When I load this I don't see 3M placed instances:
[INFO ODB-0131] Created 714938 components and 4540386 component-terminals.
[INFO ODB-0133] Created 768239 nets and 2717737 connections.
nvm I see your script is not for this test case. It would be a lot easier if you had just packaged a test case.
I'm sorry about that, bsg_chip is the next smaller of the testcases. It's showing the same behavior, but finishes pin access in 8 cpu hours with 9.5GB of memory.
It seems the "pin groups" processing is the step that takes a lot of runtime and memory. bsg_chip has 1/4 the components of mempool_groups, but only 1/10th the cpu time: 8 hours vs 84 hours
[INFO ODB-0131] Created 714938 components and 4540386 component-terminals.
[INFO ODB-0133] Created 768239 nets and 2717737 connections.
[INFO ODB-0134] Finished DEF file: def/bsg_chip.def.gz
..
[INFO DRT-0084] Complete 668651 groups.
...
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 08:31:10, elapsed time = 00:24:01, memory = 9448.54 (MB), peak = 9971.07 (MB)
...
[INFO DRT-0199] Number of violations = 0. (after 10 iterations)
...
[INFO DRT-0267] cpu time = 10:08:17, elapsed time = 00:29:32, memory = 22309.75 (MB), peak = 23984.83 (MB)
3441163 def/ariane133_68.def.gz COMPONENTS 120202 ;
3496189 def/ariane133_51.def.gz COMPONENTS 121794 ;
3700625 def/mempool_tile.def.gz COMPONENTS 128515 ;
4652916 def/nvdla.def.gz COMPONENTS 166393 ;
19112975 def/bsg_chip.def.gz COMPONENTS 714938 ;
96402592 def/mempool_group.def.gz COMPONENTS 3099210 ;
281212654 def/cluster.def.gz COMPONENTS 9876330 ;
How many threads are you using when you observe this runtime?
I’ve used openroad -threads max on a 16 core 32 thread 7950X. This utilizes all 32 threads, 100% cpu usage when running “pin groups” Ubuntu on wsl with 56GB mem + 14GB swap
Just one more datapoint for the bsg_chip running on a M1 (4+4) MacBook 16GB macOS Sonoma.
[INFO DRT-0166] Complete pin access. [INFO DRT-0267] cpu time = 06:50:42, elapsed time = 01:04:41, memory = 2266.61 (MB), peak = 6531.78 (MB)
[INFO DRT-0198] Complete detail routing. [INFO DRT-0267] cpu time = 16:22:52, elapsed time = 02:20:42, memory = 6144.62 (MB), peak = 6811.89 (MB)
Please note that on Mac openroad's memory reporting is way off, top and Activity Monitor reported usage of 16GB.
I ran vtune and I did see one surprise in the profile. I'm testing a fix now and measuring the benefit.
With the merge of #5050 please try again and report what you see. I saw some benefit but the cpu times I got don't match yours (perhaps different hw?).
I have a second smaller one to look at.
Yay, quite some improvement.
Going from a loop over all named pins to look them up via an index clearly helped a lot.
Having said that, I'm pretty sure that something much better can be done algorithmically. It shouldn't take cpu hours to compute pin access points for 700K stdcells ...
7950X: 32 threads -> 4x faster cpu/elapsed
M1: 8 threads -> 2x faster cpu/elapsed.
7950X: < [INFO DRT-0267] cpu time = 08:31:10, elapsed time = 00:24:01, memory = 9448.54 (MB), peak = 9971.07 (MB) 7950X: > [INFO DRT-0267] cpu time = 02:17:07, elapsed time = 00:06:04, memory = 9413.39 (MB), peak = 9970.04 (MB)
M1: < [INFO DRT-0267] cpu time = 06:50:42, elapsed time = 01:04:41, memory = 2266.61 (MB), peak = 6531.78 (MB) M1: > [INFO DRT-0267] cpu time = 03:43:15, elapsed time = 00:33:14, memory = 4834.28 (MB), peak = 7545.33 (MB)
M1: top/Activity Monitor show 14GB memory usage. No idea why both run's reporting differs so much.
3M stdcells, 7950X 32 threads, 43 cpu, 1 1/2 elapsed hours for "pin access analysis" is still way too slow.
I'm still trying to wrap my head around how this algorithm ends up with 1339001 groups ...
[INFO DRT-0078] Complete 15285 pins.
...
[INFO DRT-0081] Complete 410 unique inst patterns.
...
[INFO DRT-0084] Complete 1339001 groups.
#scanned instances = 3099210
#unique instances = 555
#stdCellGenAp = 16776
#stdCellValidPlanarAp = 0
#stdCellValidViaAp = 11479
#stdCellPinNoAp = 0
#stdCellPinCnt = 10978298
#instTermValidViaApCnt = 0
#macroGenAp = 19262
#macroValidPlanarAp = 19262
#macroValidViaAp = 0
#macroNoAp = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 43:41:24, elapsed time = 01:33:30, memory = 37107.12 (MB), peak = 39166.98 (MB)
Comparing ispd19_test6 with ispd24 bsg_chip (the latter using todays pin access fix)
180k instances 00:02:33 cpu time ispd19 test6. ?? tech
715k instances 03:43:15 cpu time ispd24 bsg_chip, Nangate45 tech
Why would a design with 4x the instances require 95x more time to compute pin access ? Some exponential runtime behavior in some part of the algorithm ? Maybe the OBS geometry in the fakerams ? Or just the different technologies rules or the MACRO PIN/OBS ?
[INFO DRT-0084] Complete 172607 groups.
#scanned instances = 179881
#unique instances = 138
#stdCellGenAp = 4656
#stdCellValidPlanarAp = 0
#stdCellValidViaAp = 3312
#stdCellPinNoAp = 0
#stdCellPinCnt = 790550
#instTermValidViaApCnt = 0
#macroGenAp = 141404
#macroValidPlanarAp = 141404
#macroValidViaAp = 141392
#macroNoAp = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 00:02:33, elapsed time = 00:00:21, memory = 4391.97 (MB), peak = 4788.41 (MB)
[INFO DRT-0084] Complete 668651 groups.
#scanned instances = 714938
#unique instances = 428
#stdCellGenAp = 11179
#stdCellValidPlanarAp = 0
#stdCellValidViaAp = 7726
#stdCellPinNoAp = 0
#stdCellPinCnt = 2674937
#instTermValidViaApCnt = 0
#macroGenAp = 41972
#macroValidPlanarAp = 41972
#macroValidViaAp = 0
#macroNoAp = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 03:43:15, elapsed time = 00:33:14, memory = 4834.28 (MB), peak = 7545.33 (MB)
One more datapoint: The pin access time of the mega boom groute testcase is just 47 sec for 1.7M std cells asap7. Is it correct to assume it's that fast because the utilization is so low that hardly any stdcell touch each other and thus there are no clusters ?
#scanned instances = 1781981
#unique instances = 410
#stdCellGenAp = 14360
#stdCellValidPlanarAp = 120
#stdCellValidViaAp = 11902
#stdCellPinNoAp = 0
#stdCellPinCnt = 5787310
#instTermValidViaApCnt = 0
#macroGenAp = 21821
#macroValidPlanarAp = 21821
#macroValidViaAp = 0
#macroNoAp = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 00:00:47, elapsed time = 00:00:10, memory = 5558.44 (MB), peak = 8123.77 (MB)
With #5051 I don't see anymore low hanging fruit. The runtime is spent in the drc engine as expected. On a 32-cpu machine I see
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 05:30:19, elapsed time = 00:10:33, memory = 18461.00
(MB), peak = 20493.52 (MB)
The actual routing will take many elapsed hours so improving this further doesn't seem worthwhile.
Is it correct to assume it's that fast because the utilization is so low that hardly any stdcell touch each other and thus there are no clusters ?
Most likely but I haven't looked. Your test case has unrealistically high placement density (>90%).
These are not "my test cases", these are the ispd24 global route contest test cases. "real world netlists, mapped to nangate45, placed by $$$ eda tool, no powergrid"
I agree that without power grid they have unrealistic high placement density. But they can be routed, so they're valid test cases.
Your two rounds of fixes (1. use index instead of for loop over all names comparing strings and 2. use a cache), caused the elapsed time go from three hours to 1.5 hours down to 5 min, great.
Pin access is using an awful lot of memory for 3M stdcells.
Here are the runtime for mempool_group.def which triggered this issue, case closed.
7950X 32 threads, 56GB memory
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)
[INFO DRT-0267] cpu time = 43:41:24, elapsed time = 01:33:30, memory = 37107.12 (MB), peak = 39166.98 (MB) index
[INFO DRT-0267] cpu time = 03:00:23, elapsed time = 00:05:45, memory = 37104.12 (MB), peak = 39166.70 (MB) cache
You're welcome
Description
85 cpu hours to create pin access for 3M placed instances sounds like a lot. I'm trying to get a feeling what the runtime is spend on, but currently the individual steps outlined in the paper don't report their runtime. Who would know ? Any low hanging fruit to speed up this step ?
https://vlsicad.ucsd.edu/Publications/Journals/j133.pdf
Is this "for each pin" for each pin of every cell or for each pin of very instance ?
Suggested Solution
No response
Additional Context
No response