The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.65k stars 568 forks source link

ispd19_test6 detail route runtime increased from 20min to 50min (2022->2023) #3535

Closed stefanottili closed 1 year ago

stefanottili commented 1 year ago

Describe the bug

This is a follow up to https://github.com/The-OpenROAD-Project/OpenROAD/issues/1888. That was closed because ispd19_test6 now routes without any violations, in May 2022 is left one macro pin violation.

When I checked that it routes without error, I noticed that the runtime went from 30min in May 2022 to 50min in June 2023.

[INFO DRT-0267] cpu time = 03:30:03, elapsed time = 00:29:38, memory = 6991.02 (MB), peak = 7285.50 (MB) [INFO DRT-0267] cpu time = 05:53:41, elapsed time = 00:50:15, memory = 7384.67 (MB), peak = 8466.27 (MB)

Same guides, very similar wire length.

In 2022 reading route guides was an option of detailed_route, in 2023 it's a separate command. The only difference I noticed is that the GCELLGRID is different, despite none of the inputs changed.

May2022: [INFO DRT-0176] GCELLGRID X -100 DO 905 STEP 3000 ; June2023: [INFO DRT-0176] GCELLGRID X 0 DO 905 STEP 3000 ;

Expected Behavior

It might be worth looking into what caused runtime to go from 30min to 50min. log_read_guide_Jun2023.gz log_read_guide_May2022.gz

Environment

Jun26 2023 OR built locally on M1 16GB MacOS

To Reproduce

read_lef ispd19_test6/ispd19_test6.input.lef.gz read_def ispd19_test6/ispd19_test6.input.def.gz set_thread_count 8 read_guides ispd19_test6/ispd19_test6.guide detailed_route -output_drc 5_route_drc.rpt -verbose 1

Relevant log output

No response

Screenshots

No response

Additional Context

No response

stefanottili commented 1 year ago

2022: 20min route guides 2022: 30min global route

2023: 50min route guides 2023: 47min global route using set_routing_layers -signal Metal2-Metal9 -clock Metal2-Metal9 set_global_routing_layer_adjustment Metal2-Metal9 0.5

maliberty commented 1 year ago

I'll take a look but it is less compelling as the old result was worse than the new result (1 drv vs 0).

stefanottili commented 1 year ago

I completely understand the "correct routing first" mantra.

But ... a 2.5x runtime increase to fix a single macro pin access ? That's way too much of a runtime increase not to call it a regression.

Whomever is working on the detail router should at least be aware of what the additional runtime is used for. After all it's the same technology and the same route guides that went from 20min to 50min.

maliberty commented 1 year ago

With a one year window and a moderately long running test case, it will take a lot to isolate this to any particular change.

It possible that the old results were actually more incorrect if run through a signoff DRC engine rather than relying on the drt reported errors (we find and fix missed violations occasionally).

maliberty commented 1 year ago

I ran just iter0 and looked at the drcs. I find in one area in the old build (22b748a8b9add36813675bcbbcf482b6d264499e) it looks reasonable: image

while in the new build (3b67f4e01529f6add7eccb20e952fa1140ff972e) there are a lot of weird violations and jogs: image

@osamahammad21 any ideas?

QuantamHD commented 1 year ago

Should we grab a linux perf profile of both? Then run a diff?

maliberty commented 1 year ago

I think it is more algorithmic. I notice the oddities are periodic and it nearly matches the worker size which makes me think it is a boundary condition.

This diff could be related

May2022: [INFO DRT-0176] GCELLGRID X -100 DO 905 STEP 3000 ;
June2023: [INFO DRT-0176] GCELLGRID X 0 DO 905 STEP 3000 ;
maliberty commented 1 year ago

The worker is 10.5um: image

maliberty commented 1 year ago

net100630 is a good example of the problem image

maliberty commented 1 year ago

The problem originates in GlobalRouter::updateDbCongestionFromGuides() which we only call when guides are read from a file. This will affect ispd benchmarks but not ORFS flows. That said we'll fix it.

maliberty commented 1 year ago

Thanks for noticing this issue and keeping such old results to compare against.

stefanottili commented 1 year ago

a) I'm not sure why this issue got closed, see attached log from today Jun 28 after #3545

May 2022: 20min
Jun 2023: 50min
Jun 28th 2023: 61min

[INFO DRT-0267] cpu time = 03:30:03, elapsed time = 00:29:38, memory = 6991.02 (MB), peak = 7285.50 (MB)
[INFO DRT-0267] cpu time = 05:53:41, elapsed time = 00:50:15, memory = 7384.67 (MB), peak = 8466.27 (MB)
[INFO DRT-0267] cpu time = 03:47:36, elapsed time = 01:01:44, memory = 7581.58 (MB), peak = 10308.50 (MB)

interestingly, the cpu time is only 15min larger then in May2022 at the cost of 3GB more memory. but the elapsed time as of today is 2x (same M1 16GB, nothing else running on it)

b) do you track runtime for CI ?

This runtime increase for an ispd19 test case that run as part of your CI shouldn't come as a surprise, right ?

c) how do these runtimes compare against CI or your developer machines ?

log_read_guide_Jun28_2023.gz

maliberty commented 1 year ago

It ran in time comparable to the old result for me. I no longer see the GCELLGRID difference. Is that resolved in your run?

Runtime CIs tend to be very noise due to varying machine loads. We don't have one but perhaps we need one for gross errors.

I run on a 40 cpu machine and see 2023

[INFO DRT-0267] cpu time = 05:20:09, elapsed time = 00:15:21, memory = 6519.86 (
MB), peak = 13292.98 (MB)

2022

[INFO DRT-0267] cpu time = 04:36:15, elapsed time = 00:14:27, memory = 7435.48 (
MB), peak = 11682.37 (MB)
stefanottili commented 1 year ago
[INFO DRT-0176] GCELLGRID X -100 DO 905 STEP 3000 ; 

Same GELLGRID for May 2022 and Jun28 2023, so I'm running with the fix for read_guides. The May 2022 runtime was read_guide 20min (and global_route 30min).

Q: What was your 40cpu runtime before #3545 ?

I ran again, now I'm getting 70min vs 61min this morning. See attached log. Quite some variation for elapsed time during the 1st-3rd iteration, cpu time seems close. I didn't expect that kind of variation on a single user machine, especially since a diff shows that the # of violations are identical between both runs.

But 61min or 70min runtimes are still a lot longer then the 30min in 2022 (same M1 16GB, updated MacOS). Since for you May2022 vs Jun28 2023 shows similar runtimes, I guess we'll call it a day.

gunzip -c log_read_guide_Jun28_2023.gz| grep "cpu tim"
[INFO DRT-0267] cpu time = 00:03:20, elapsed time = 00:00:28, memory = 2870.88 (MB), peak = 3251.95 (MB)
[INFO DRT-0267] cpu time = 00:04:02, elapsed time = 00:00:50, memory = 4577.19 (MB), peak = 7858.73 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:01, memory = 5164.72 (MB), peak = 7858.73 (MB)
[INFO DRT-0267] cpu time = 01:32:06, elapsed time = 00:25:31, memory = 8547.78 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 01:13:28, elapsed time = 00:12:04, memory = 8093.03 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:46:28, elapsed time = 00:21:40, memory = 8508.78 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:15:04, elapsed time = 00:02:16, memory = 7538.39 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:00:24, elapsed time = 00:00:07, memory = 7581.31 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:01, memory = 7581.45 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:00, memory = 7581.52 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 00:00:00, elapsed time = 00:00:00, memory = 7581.56 (MB), peak = 10308.50 (MB)
[INFO DRT-0267] cpu time = 03:47:36, elapsed time = 01:01:44, memory = 7581.58 (MB), peak = 10308.50 (MB)

gunzip -c log_read_guide_Jun28_2023_second_run.gz| grep "cpu tim" 
[INFO DRT-0267] cpu time = 00:03:03, elapsed time = 00:00:25, memory = 2871.16 (MB), peak = 3246.22 (MB)
[INFO DRT-0267] cpu time = 00:03:41, elapsed time = 00:00:45, memory = 4717.86 (MB), peak = 7948.14 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:01, memory = 5391.06 (MB), peak = 7948.14 (MB)
[INFO DRT-0267] cpu time = 01:25:52, elapsed time = 00:18:12, memory = 8039.05 (MB), peak = 8938.23 (MB)
[INFO DRT-0267] cpu time = 01:09:07, elapsed time = 00:42:37, memory = 7382.05 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:48:51, elapsed time = 00:06:51, memory = 8021.25 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:15:44, elapsed time = 00:02:21, memory = 7642.38 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:00:24, elapsed time = 00:00:07, memory = 7704.36 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:01, memory = 7704.55 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:00:01, elapsed time = 00:00:00, memory = 7704.59 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 00:00:00, elapsed time = 00:00:00, memory = 7704.62 (MB), peak = 8983.64 (MB)
[INFO DRT-0267] cpu time = 03:40:04, elapsed time = 01:10:13, memory = 7704.64 (MB), peak = 8983.64 (MB)

log_read_guide_Jun28_2023_second_run.gz

QuantamHD commented 1 year ago

@stefanottili This might sound like a joke, but it isn't, can you try to put your macbook in your fridge/freezer, and see if that has any effect on the runtime? Your macbook might be thermal throttling, and OpenROAD may be doing something today that affects that.

maliberty commented 1 year ago

I don't have the log but it was more in line with your >2X slowdown

stefanottili commented 1 year ago
[INFO DRT-0267] cpu time = 04:20:38, elapsed time = 00:37:03, memory = 7934.11 (MB), peak = 9004.53 (MB)

Wow, I didn't expect this. No need to use the fridge (even though the machine got "hand warm")

After a shutdown/reboot, I got 37min elapsed time despite 4h20 vs 3h40 cpu time. The # of violations per iteration are identical. So in conclusion: don't trust any runtimes on M1 16GB MacBookAir MacOS Ventura without a reboot !

log_read_guide_Jun28_2023_third_run.gz