Open stefanottili opened 6 months ago
OpenROAD-flow-scripts is a flow that uses OpenROAD. This flow is based on invoking OpenROAD many times, reading in the previous .odb file and writing out an updated one for each stage and reporting times and memory usage.
If I understand correctly, you describe a flow where a single .tcl script is used and the OpenROAD binary is invoked once.
OpenROAD-flow-scripts used to have a "single .tcl script" version of the flow, but it was abandoned because it wasn't being used, maintained and there were no tests for it.
There are many flows other than OpenROAD-flow-scripts that exist on top of the OpenROAD tool. To name a few: Hammer, Zero ASIC SiliconCompiler, bazel_hdl.
None of these flows, to my knowledge, is run by a single .tcl script. The OpenROAD binary is invoked multiple times.
For these flows, OpenROAD is not the only tool being used that memory and times have to be tracked for.
Hence, I think today's solution where the measurement of memory usage and running times is outside of the scope of the OpenROAD binary seems to work well. There are some notable exceptions, memory consumption and running times is tracked and printed for each step within detailed routing.
So in short, I think that absent of a more specific description for changes, I think that today's solution is pretty good. It has evolved to the way it is for good reasons.
I have created log files from years ago running ispd routing testcases, using OpenRoad global and detailed route one script OpenRoad tcl script, no need to save and reload odb databases in between. It would have been nice then and even nicer now to have runtime info in the logs, alongside the information which commands were run. Both of which changed over time.
Having multiple flows to implement their own runtime logging sounds like an argument to put it into the tool itself to me. I can’t think of a p&r tool I’ve used over the decades that wouldn’t log its runtimes.
I think it is a nice to have though not a high priority. To be clear this is command level timing rather than flow step level timing so it is a bit more fine grained. In general OR doesn't have a unified approach to commands so adding this will likely happen when there is some motivating need to make commands more common. Dependencies like sta further complicate the implementation.
Detailed routing does already report its runtime so that should be present in your old logs.
It's definitely not "high priority", but very much "nice to have". Not once have I thought "gee, I wish this tool wouldn't report it's runtime or memory usage" ...
Making it easy to compare run's between versions/machines/implementations by diff'ing/grep'ing/scripting is useful. Key is to have something consistent and easy to grep for. E.g. "egrep -e CMD -e TIME openroad.log". Not for the "progress report/internal details", but only for the "mayor, time consuming steps".
There is a fine balance between "show some progress/info" and "keep the log file short and concise".
We could either report whenever a command takes more than X time or just decide which commands are interesting to always report. The former will make the output somewhat non-deterministic around the threshold, the later may report small times on small designs. I favor the later for determinism in which case in which case we should make a list.
A few comments:
Every “long running” command should report how long it took to run it.
In my use-case this info would have been useful to me.
So it sounds like our preferences differ and there is the question of the resources necessary to implement this request.
If long running commands (placement, resize, global route, antenna, detailed route) would report time (in c++), this would in no way impede other flows to ignore this info and log this some other way as today, best of both worlds ... Considering the existing log file clutter, these few additional lines don't add much.
Would
global_place
rtl_macro_placer
repair_design
repair_timing
global_route
detailed_route
suffice? I suspect those are the big ticket items and wouldn't add too much verbosity.
repair_timing specifically can take surprisingly long, so having extra detail there would be useful.
Also report timing can be very slow.
CTS is a good example.I have seen cases where CTS takes 10 minutes and report/fix timing takes 24 hours.
Adding time to logs makes them non hermetic so this should be an opt in thing.
Adding time to logs makes them non hermetic so this should be an opt in thing.
This is already the case for detailed routing. How do you deal with that (in bazel_hdl, presumably?)?
repair_timing does give detail with -verbose. This would just be adding an overall time message.
Here's an example why runtime and memory consumption reporting from openroad is useful. Todays openroad on a machine with 32 threads, 56GB memory + 14GB swap, wsl2 unbuntu:
Now I know not to try to detail_route the 3M stdcell ispd24 mempool_group testcase again.
read_def is fast placement density displays instantaneous rudy takes 35sec before the cursor switches from pointer to busy global_route -verbose didn't report runtime/memory. detailed_route reported two runtime/memory datapoints before being killed.
OpenROAD v2.0-13573-g9e909da77
Features included (+) or not (-): +Charts +GPU +GUI +Python
...
Design: mempool_group
Die area: ( 0 0 ) ( 6767040 6762560 )
Number of track patterns: 20
...
[WARNING DRT-0419] No routing tracks pass through the center of Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[0]
[WARNING DRT-0418] Term gen_tiles\[4\].i_tile/gen_banks\[6\].mem_bank/genblk1.sram_instance/rd_out[1] has no pins on routing grid
...
#scanned instances = 3099210
#unique instances = 555
#stdCellGenAp = 16776
#stdCellValidPlanarAp = 0
#stdCellValidViaAp = 11479
#stdCellPinNoAp = 0
#stdCellPinCnt = 10978298
#instTermValidViaApCnt = 0
#macroGenAp = 19262
#macroValidPlanarAp = 19262
#macroValidViaAp = 0
#macroNoAp = 0
[INFO DRT-0166] Complete pin access.
[INFO DRT-0267] cpu time = 84:59:19, elapsed time = 03:09:21, memory = 37294.33 (MB), peak = 39299.85 (MB)
....
Number of guides: 28748042
[INFO DRT-0169] Post process guides.
[INFO DRT-0176] GCELLGRID X 0 DO 1611 STEP 4200 ;
[INFO DRT-0177] GCELLGRID Y 0 DO 1610 STEP 4200 ;
...
[INFO DRT-0181] Start track assignment.
[INFO DRT-0184] Done with 8693270 vertical wires in 33 frboxes and 14344011 horizontal wires in 33 frboxes.
[INFO DRT-0186] Done with 5901687 vertical wires in 33 frboxes and 4252651 horizontal wires in 33 frboxes.
[INFO DRT-0182] Complete track assignment.
[INFO DRT-0267] cpu time = 01:01:40, elapsed time = 00:11:57, memory = 54316.77 (MB), peak = 55276.92 (MB)
[INFO DRT-0187] Start routing data preparation.
Killed
Not a great example as the command didn't complete so you wouldn't get any reporting at the end. Reporting usage during progress is not what was requested here.
I think what DRT needs is some automatic adjustment to the number of threads before it starts running in case it detects that the system memory is too low. I know of some proprietary tools which do this based on system load (% CPU) but because DRT memory scales pretty linearly with threads, it should hopefully be easy to do here.
@rovinski Interesting... I think it would be a good idea to make this improvement to detailed routing, even if it would not affect the maximum memory usage in all cases.
Example from megaboom, here maximum memory usage for global and local routing is about the same. Note that megaboom runs with various things disabled, which will underestimate memory usage, though how much I don't know.
Features are disabled mainly to rein in running times to something somewhat manageable.
Log Elapsed seconds Peak Memory/MB
1_1_yosys 4043 4528
2_1_floorplan 137 7208
2_2_floorplan_io 13 3090
2_4_floorplan_macro 711 7799
2_5_floorplan_tapcell 529 2144
2_6_floorplan_pdn 311 10416
3_1_place_gp_skip_io 21 3365
3_2_place_iop 28 4540
3_3_place_gp 6015 14747
3_4_place_resized 565 8064
3_5_place_dp 1222 12899
4_1_cts 449 13227
5_1_grt 5246 59714
5_2_fillcell 76 13239
5_3_route 18100 60493
6_1_merge 388 29685
6_report 6442 40855
generate_abstract 874 30833
Total 45170 60493
It sounds quite fragile. The amount of free memory may change quite dynamically (particularly on a multi-user system). It also depends on how expensive swapping is (and how large the working set is). Its also hard to know exactly how much memory drt will use in advance.
I was thinking more conservatively of (threads * memory/thread) < total system memory. Not even including free memory. Beyond that, you are guaranteed to dip into swap and risk significant slowdown. In an even more conservative case, at least a warning could be issued.
Is it hard to know how much memory it uses? When I was experimenting a while ago, it seemed pretty deterministic based on the technology, design size, and number of threads.
If I overestimate the usage and unnecessarily throttle the run I get a bug report. If I underestimate the memory and the run is slowed down or killed I get a bug report. I have to hit some goldilocks window every time. If you feel you can do that then please submit a PR (and plan to support it).
The user can control this via NUM_CORES now, right?
What about memory usage for the rest of the flow?
If I underestimate the memory and the run is slowed down or killed I get a bug report.
But that's the currently the case anyways 😆. If introduced, hopefully it would be less.
No such promise is made today and therefore nothing needs "fixing".
If I underestimate the memory and the run is slowed down or killed I get a bug report.
But that's the currently the case anyways 😆. If introduced, hopefully it would be less.
I think today there is a well defined behavior where the user has to make an annoying choice that has deterministic effects: NUM_CORES.
From a bug/feature request point of view, I think this is the least bad baseline. I can see how any change to this baseline(default behavior) practially(from a maintainers point of view) has to be only better and never worse.
In bazel-orfs, I can easily specify NUM_CORES per stage(floorplan, place, route, etc.), the same is possible in ORFS, but requires writing a script that runs one make comand per stage(not a big deal for a flow that is memory constrainted, which probably also runs for a very long time).
Perhaps the final summary report could include "marginal memory cost/core"?
The marginal memory cost per core refers to the additional memory consumption incurred by adding one more processing core to a system
That would require some extra logging by e.g. detailed routing based upon knowledge of the memory consumption.
The user can then choose to modify NUM_CORES for the problematic stage.
Come to think of it: if the marginal memory cost per CPU can't be estimated and reported, then it isn't possible to implement a policy to select the "best" number of CPUs as constrained by available memory.
If the marginal speedup per core can be estimated(assuming roughly linear for simplication) and the marginal memory cost can be estimated and reported, then it is possible for the user to decide on a NUM_CORES tradeof manually. Currently a maximum speed policy is in place(use all cores) and a minimum memory usage could be estimated in a column. If the peak memory usage isn't reduced(grt doesn't use parallelism, does it?), then the user would know that nothing is gained as such by reducing parallelism.
In fact, it would be easy to run an experiment. Run a bunch of runs w NUM_CORES w 1 through number of cores and plot run times and memory usage for each stage... Should tell a lot...
This is all predicated on the idea that there is some fixed cost per thread and that it remains a constant across iterations. I'm not clear either is true but am open to see your results.
This issue is getting taken over by an unrelated request. If people want to pursue it then it should be a separate issue.
Hello, I would like to work on this issue, however from the conversation its not clear what is the final request for this issue. To summarize what I understood it's that for the following commands runtime and memory usage should be logged from within the commands and the rest of the conversation is considered as a separate request.
global_place
rtl_macro_placer
repair_design
repair_timing
global_route
detailed_route
Could you please confirm if this is the case?
I think that would be good start. People can request more commands if they see long runtimes.
I'll be pretty forceful here and say this should be an option you need to enable, and not default behavior.
Why? drt already does this, how are extra commands different?
Hi all, I have started working on this one.
Description
Every “long running” command should report how long it took to run it. The goal would be to be able to easily compare log files from different runs. E.g. vimdiff dir[12]/log or grep runtime dir2/log > dir2/runtime diff dir[12]/runtime
Suggested Solution
Currently there seems to be a way to get ORFS to report runtime and memory usage. But this relies on running the openroad executable for each and every step. And it's not "built into" openroad, it relies on external scripting. It should be possible to have one OR tcl script to run "the whole flow" and still get this information in the log.
Additional Context
No response