The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.35k stars 481 forks source link

Long running OR commands should report their runtime #5003

Open stefanottili opened 3 weeks ago

stefanottili commented 3 weeks ago

Description

Every “long running” command should report how long it took to run it. The goal would be to be able to easily compare log files from different runs. E.g. vimdiff dir[12]/log or grep runtime dir2/log > dir2/runtime diff dir[12]/runtime

Suggested Solution

Currently there seems to be a way to get ORFS to report runtime and memory usage. But this relies on running the openroad executable for each and every step. And it's not "built into" openroad, it relies on external scripting. It should be possible to have one OR tcl script to run "the whole flow" and still get this information in the log.

Log                            Elapsed seconds Peak Memory/KB
1_1_yosys                                  243         971412
2_1_floorplan                               27         510688
2_2_floorplan_io                             4         363812
2_3_floorplan_tdms                           4         362360
2_4_floorplan_macro                          4         367364
2_5_floorplan_tapcell                        4         325804
2_6_floorplan_pdn                            6         381020
3_1_place_gp_skip_io                        38         461672
3_2_place_iop                                4         372996
3_3_place_gp                               613        1048224
3_4_place_resized                           94         602152
3_5_place_dp                                78         657092
4_1_cts                                    152         747132
5_1_grt                                   6137        1298604
5_2_fillcell                                 5         494576
5_3_route                                  543       11470100
6_1_merge                                   15         815860
6_report                                   198        2751824
Total                                     8169       11470100

Additional Context

No response

oharboe commented 2 weeks ago

OpenROAD-flow-scripts is a flow that uses OpenROAD. This flow is based on invoking OpenROAD many times, reading in the previous .odb file and writing out an updated one for each stage and reporting times and memory usage.

If I understand correctly, you describe a flow where a single .tcl script is used and the OpenROAD binary is invoked once.

OpenROAD-flow-scripts used to have a "single .tcl script" version of the flow, but it was abandoned because it wasn't being used, maintained and there were no tests for it.

There are many flows other than OpenROAD-flow-scripts that exist on top of the OpenROAD tool. To name a few: Hammer, Zero ASIC SiliconCompiler, bazel_hdl.

None of these flows, to my knowledge, is run by a single .tcl script. The OpenROAD binary is invoked multiple times.

For these flows, OpenROAD is not the only tool being used that memory and times have to be tracked for.

Hence, I think today's solution where the measurement of memory usage and running times is outside of the scope of the OpenROAD binary seems to work well. There are some notable exceptions, memory consumption and running times is tracked and printed for each step within detailed routing.

So in short, I think that absent of a more specific description for changes, I think that today's solution is pretty good. It has evolved to the way it is for good reasons.

stefanottili commented 2 weeks ago

I have created log files from years ago running ispd routing testcases, using OpenRoad global and detailed route one script OpenRoad tcl script, no need to save and reload odb databases in between. It would have been nice then and even nicer now to have runtime info in the logs, alongside the information which commands were run. Both of which changed over time.

Having multiple flows to implement their own runtime logging sounds like an argument to put it into the tool itself to me. I can’t think of a p&r tool I’ve used over the decades that wouldn’t log its runtimes.

maliberty commented 2 weeks ago

I think it is a nice to have though not a high priority. To be clear this is command level timing rather than flow step level timing so it is a bit more fine grained. In general OR doesn't have a unified approach to commands so adding this will likely happen when there is some motivating need to make commands more common. Dependencies like sta further complicate the implementation.

Detailed routing does already report its runtime so that should be present in your old logs.

stefanottili commented 2 weeks ago

It's definitely not "high priority", but very much "nice to have". Not once have I thought "gee, I wish this tool wouldn't report it's runtime or memory usage" ...

Making it easy to compare run's between versions/machines/implementations by diff'ing/grep'ing/scripting is useful. Key is to have something consistent and easy to grep for. E.g. "egrep -e CMD -e TIME openroad.log". Not for the "progress report/internal details", but only for the "mayor, time consuming steps".

There is a fine balance between "show some progress/info" and "keep the log file short and concise".

maliberty commented 2 weeks ago

We could either report whenever a command takes more than X time or just decide which commands are interesting to always report. The former will make the output somewhat non-deterministic around the threshold, the later may report small times on small designs. I favor the later for determinism in which case in which case we should make a list.

rovinski commented 2 weeks ago

A few comments:

stefanottili commented 2 weeks ago

Every “long running” command should report how long it took to run it.

In my use-case this info would have been useful to me.

So it sounds like our preferences differ and there is the question of the resources necessary to implement this request.

If long running commands (placement, resize, global route, antenna, detailed route) would report time (in c++), this would in no way impede other flows to ignore this info and log this some other way as today, best of both worlds ... Considering the existing log file clutter, these few additional lines don't add much.

maliberty commented 2 weeks ago

Would

global_place
rtl_macro_placer 
repair_design
repair_timing
global_route
detailed_route

suffice? I suspect those are the big ticket items and wouldn't add too much verbosity.

oharboe commented 2 weeks ago

repair_timing specifically can take surprisingly long, so having extra detail there would be useful.

Also report timing can be very slow.

CTS is a good example.I have seen cases where CTS takes 10 minutes and report/fix timing takes 24 hours.

QuantamHD commented 2 weeks ago

Adding time to logs makes them non hermetic so this should be an opt in thing.

oharboe commented 2 weeks ago

Adding time to logs makes them non hermetic so this should be an opt in thing.

This is already the case for detailed routing. How do you deal with that (in bazel_hdl, presumably?)?

maliberty commented 2 weeks ago

repair_timing does give detail with -verbose. This would just be adding an overall time message.

stefanottili commented 2 weeks ago

Here's an example why runtime and memory consumption reporting from openroad is useful. Todays openroad on a machine with 32 threads, 56GB memory + 14GB swap, wsl2 unbuntu:

Now I know not to try to detail_route the 3M stdcell ispd24 mempool_group testcase again.

read_def is fast placement density displays instantaneous rudy takes 35sec before the cursor switches from pointer to busy global_route -verbose didn't report runtime/memory. detailed_route reported two runtime/memory datapoints before being killed.

OpenROAD v2.0-13573-g9e909da77
Features included (+) or not (-): +Charts +GPU +GUI +Python
...
Design:                   mempool_group
Die area:                 ( 0 0 ) ( 6767040 6762560 )
Number of track patterns: 20
...
[WARNING DRT-0419] No routing tracks pass through the center of Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[0]
[WARNING DRT-0418] Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[1] has no pins on routing grid
...
#scanned instances     = 3099210
#unique  instances     = 555
#stdCellGenAp          = 16776
#stdCellValidPlanarAp  = 0
#stdCellValidViaAp     = 11479
#stdCellPinNoAp        = 0
#stdCellPinCnt         = 10978298
#instTermValidViaApCnt = 0
#macroGenAp            = 19262
#macroValidPlanarAp    = 19262
#macroValidViaAp       = 0
#macroNoAp             = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)
....
Number of guides:     28748042

[INFO DRT-0169] Post process guides.
[INFO DRT-0176] GCELLGRID X 0 DO 1611 STEP 4200 ;
[INFO DRT-0177] GCELLGRID Y 0 DO 1610 STEP 4200 ;
...
[INFO DRT-0181] Start track assignment.
[INFO DRT-0184] Done with 8693270 vertical wires in 33 frboxes and 14344011 horizontal wires in 33 frboxes.
[INFO DRT-0186] Done with 5901687 vertical wires in 33 frboxes and 4252651 horizontal wires in 33 frboxes.
[INFO DRT-0182] Complete track assignment.
[INFO DRT-0267] cpu time = 01:01:40, elapsed time = 00:11:57, memory = 54316.77 (MB), peak = 55276.92 (MB)
[INFO DRT-0187] Start routing data preparation.
Killed
maliberty commented 2 weeks ago

Not a great example as the command didn't complete so you wouldn't get any reporting at the end. Reporting usage during progress is not what was requested here.

rovinski commented 2 weeks ago

I think what DRT needs is some automatic adjustment to the number of threads before it starts running in case it detects that the system memory is too low. I know of some proprietary tools which do this based on system load (% CPU) but because DRT memory scales pretty linearly with threads, it should hopefully be easy to do here.

oharboe commented 2 weeks ago

@rovinski Interesting... I think it would be a good idea to make this improvement to detailed routing, even if it would not affect the maximum memory usage in all cases.

Example from megaboom, here maximum memory usage for global and local routing is about the same. Note that megaboom runs with various things disabled, which will underestimate memory usage, though how much I don't know.

Features are disabled mainly to rein in running times to something somewhat manageable.

Log                            Elapsed seconds Peak Memory/MB
1_1_yosys                                 4043           4528
2_1_floorplan                              137           7208
2_2_floorplan_io                            13           3090
2_4_floorplan_macro                        711           7799
2_5_floorplan_tapcell                      529           2144
2_6_floorplan_pdn                          311          10416
3_1_place_gp_skip_io                        21           3365
3_2_place_iop                               28           4540
3_3_place_gp                              6015          14747
3_4_place_resized                          565           8064
3_5_place_dp                              1222          12899
4_1_cts                                    449          13227
5_1_grt                                   5246          59714
5_2_fillcell                                76          13239
5_3_route                                18100          60493
6_1_merge                                  388          29685
6_report                                  6442          40855
generate_abstract                          874          30833
Total                                    45170          60493
maliberty commented 2 weeks ago

It sounds quite fragile. The amount of free memory may change quite dynamically (particularly on a multi-user system). It also depends on how expensive swapping is (and how large the working set is). Its also hard to know exactly how much memory drt will use in advance.

rovinski commented 2 weeks ago

I was thinking more conservatively of (threads * memory/thread) < total system memory. Not even including free memory. Beyond that, you are guaranteed to dip into swap and risk significant slowdown. In an even more conservative case, at least a warning could be issued.

Is it hard to know how much memory it uses? When I was experimenting a while ago, it seemed pretty deterministic based on the technology, design size, and number of threads.

maliberty commented 2 weeks ago

If I overestimate the usage and unnecessarily throttle the run I get a bug report. If I underestimate the memory and the run is slowed down or killed I get a bug report. I have to hit some goldilocks window every time. If you feel you can do that then please submit a PR (and plan to support it).

oharboe commented 2 weeks ago

The user can control this via NUM_CORES now, right?

What about memory usage for the rest of the flow?

rovinski commented 2 weeks ago

If I underestimate the memory and the run is slowed down or killed I get a bug report.

But that's the currently the case anyways 😆. If introduced, hopefully it would be less.

maliberty commented 2 weeks ago

No such promise is made today and therefore nothing needs "fixing".

oharboe commented 2 weeks ago

If I underestimate the memory and the run is slowed down or killed I get a bug report.

But that's the currently the case anyways 😆. If introduced, hopefully it would be less.

I think today there is a well defined behavior where the user has to make an annoying choice that has deterministic effects: NUM_CORES.

From a bug/feature request point of view, I think this is the least bad baseline. I can see how any change to this baseline(default behavior) practially(from a maintainers point of view) has to be only better and never worse.

In bazel-orfs, I can easily specify NUM_CORES per stage(floorplan, place, route, etc.), the same is possible in ORFS, but requires writing a script that runs one make comand per stage(not a big deal for a flow that is memory constrainted, which probably also runs for a very long time).

Perhaps the final summary report could include "marginal memory cost/core"?

The marginal memory cost per core refers to the additional memory consumption incurred by adding one more processing core to a system

That would require some extra logging by e.g. detailed routing based upon knowledge of the memory consumption.

The user can then choose to modify NUM_CORES for the problematic stage.

oharboe commented 2 weeks ago

Come to think of it: if the marginal memory cost per CPU can't be estimated and reported, then it isn't possible to implement a policy to select the "best" number of CPUs as constrained by available memory.

If the marginal speedup per core can be estimated(assuming roughly linear for simplication) and the marginal memory cost can be estimated and reported, then it is possible for the user to decide on a NUM_CORES tradeof manually. Currently a maximum speed policy is in place(use all cores) and a minimum memory usage could be estimated in a column. If the peak memory usage isn't reduced(grt doesn't use parallelism, does it?), then the user would know that nothing is gained as such by reducing parallelism.

oharboe commented 1 week ago

In fact, it would be easy to run an experiment. Run a bunch of runs w NUM_CORES w 1 through number of cores and plot run times and memory usage for each stage... Should tell a lot...

maliberty commented 1 week ago

This is all predicated on the idea that there is some fixed cost per thread and that it remains a constant across iterations. I'm not clear either is true but am open to see your results.

This issue is getting taken over by an unrelated request. If people want to pursue it then it should be a separate issue.