YosysHQ / yosys

Yosys Open SYnthesis Suite
https://yosyshq.net/yosys/
ISC License
3.3k stars 860 forks source link

cost: add keep_hierarchy pass with max_cost argument #4344

Open widlarizer opened 2 months ago

widlarizer commented 2 months ago

Modules being flattened improves QoR in practice. It also makes the yosys runtime take much longer.

This PR creates cost.cc with linear cost models for almost all internal cell types to estimate the size of a module after techmapping. To get most of these numbers, I used a modified version of the test_cell command, see emil/gather-cell-size.

This PR also adds the keep_hierarchy pass which marks all selected modules with that attribute and has an optional -max_cost integer argument which sets a maximum estimated cost threshold.

6. Executing keep_hierarchy pass.
Marking top.cpu (module too big: 8880 > 100).
Marking top.cpu.alu (module too big: 2068 > 100).
Marking top.cpu.brancher (module too big: 515 > 100).
Marking top.cpu.datamem (module too big: 1235 > 100).
Marking top.cpu.imm_decoder (module too big: 226 > 100).
Marking top.cpu.regfile (module too big: 4631 > 100).
<suppressed ~635 debug messages>

7. Executing FLATTEN pass (flatten design).
Keeping top.cpu.regfile (found keep_hierarchy attribute).
Keeping top.cpu.imm_decoder (found keep_hierarchy attribute).
Keeping top.cpu.datamem (found keep_hierarchy attribute).
Keeping top.cpu.brancher (found keep_hierarchy attribute).
Keeping top.cpu.alu (found keep_hierarchy attribute).
Deleting now unused module top.cpu.controller.
Deleting now unused module top.cpu.writeback.
<suppressed ~2 debug messages>

Effects on runtime and QoR with OpenROAD: TBD

whitequark commented 2 months ago

This PR creates cost.cc with linear cost models for almost all internal cell types to estimate the size of a module after techmapping.

I'm really interested in your methodology here--can you explain it in detail?

widlarizer commented 2 months ago
widlarizer commented 2 months ago

I'm really interested in your methodology here--can you explain it in detail?

@whitequark Sure: I used test_cell to generate random sizes of cells and to techmap them. Then I parse the dump and stats to get the gate count and input/output widths. I then stared at plots like this. I looked for upper bounds in gate count with regards to output port width, sum of all port widths, and the largest input port width. This worked better than I expected. For cells unsupported by test_cell I looked at their techmap or semantics. There's a TODO comment for ones where I'm not sure if I should expect them at the point right before flattening, namely, FSM and memory.

image

widlarizer commented 2 months ago

An executable that calculates the costs according to it

This is pretty intense for the current sole use case which is as a heuristic flattening only modules that aren't huge. For what it's worth, if these were all set to sum*1 except for mul+div, it probably would achieve similar behavior (I can try this out now that I got openroad benchmarks running locally). I can't anticipate other possible uses so I used realistic coefficients which I arrived at by staring at those plots. But it's not supposed to capture perfectly the simplest possible upper bound. A simple executable may even fail to get a reasonable coefficient randomly due to constants when something changes in techmap down the line

widlarizer commented 1 month ago

Sorry for the spam @zachjs, I accidentally committed changes in ast that I only used to play around

widlarizer commented 1 month ago

@whitequark test_cell is now capable of checking whether the cost is a correct post techmap gate count upper bound. This means the coefficients aren't generated programmatically, but at least they are verified, at least for cells covered by test_cell functionality. Use case example: test_cell -noeval -nosat -check_cost -bloat 4 all. I have also split it away from existing default_gate_cost beyond a simple check. Let me know what you think.

Open questions:

whitequark commented 1 month ago

test_cell is now capable of checking whether the cost is a correct post techmap gate count upper bound. This means the coefficients aren't generated programmatically, but at least they are verified, at least for cells covered by test_cell functionality.

That is much better! I'll take a closer look a bit later.

povik commented 1 month ago

find a cost model for $lut

I propose we supply a conservative upper bound based on the width of the $lut cell alone, that is, we shouldn't bother with inspecting the LUT parameter. Usually $lut are not a cell entered by the user but are a product of lut-mapping passes only, so there isn't much of a use case where someone would be applying the cost model to it. For that reason let's go with what's easiest but correct.

widlarizer commented 1 month ago

Current status and intended usage. I think I'll have to leave it as is for now. I think it's good enough as a heuristic and should move on to more pressing topics

test_cell -noeval -nosat -bloat 4 -check_cost \$not \$pos \$neg \$and \$or \$xor \$xnor \$reduce_and \$reduce_or \$reduce_xor \$reduce_xnor \$reduce_bool \$shl \$shr \$sshl \$sshr \$shift \$shiftx \$lt \$le \$eq \$ne \$ge \$gt \$add \$sub \$logic_not \$logic_and \$logic_or
Warning: Cell type $shl failed in 3.0% cases with worst offender being by 12 (120.0%)
Warning: Cell type $sshl failed in 2.0% cases with worst offender being by 11 (110.0%)

test_cell -noeval -nosat -bloat 1 -check_cost \$mux \$bmux \$demux
Warning: Cell type $demux failed in 8.0% cases with worst offender being by 30 (23.4%)

test_cell -noeval -nosat -check_cost -bloat 2 \$mul \$div \$mod \$divfloor \$modfloor 
Warning: Cell type $div failed in 1.0% cases with worst offender being by 66 (6.7%)
Warning: Cell type $modfloor failed in 3.0% cases with worst offender being by 433 (33.8%)

test_cell -noeval -nosat -bloat 2 -check_cost \$alu \$lcu \$fa
Warning: Cell type $alu failed in 20.0% cases with worst offender being by 13 (10.2%)

test_cell -noeval -nosat -noopt -bloat 2 -check_cost \$lut \$sop
phsauter commented 3 weeks ago

I don't know how far you want to go with this but you could also compare your approach against theoretical approaches to calculate the cost of arithmetic operations. For example something like Zimmermanns thesis on page 54 in the pdf (page 92 of the booklet) gives you the sizes of different adder architectures in a few different ways (equation, gate-representation and area in a specific technology). Currently Yosys uses Brent-Kung (PPA-BK) for its adders.

It is also important to note that if you set demanding timing goals and give ABC an aggressive script, it will start to deviate from this architecture and go towards the performance characteristics of others. So depending on how you use ABC, the final result for large operations and timing critical (deep) paths can deviate drastically from what you may observe with a more relaxed ABC script.

I don't think this is a huge problem since the intended use-case is likely to get rid of really small modules, so they are less likely to have these big arithmetic operations anyway.

I would recommend there is a note in the help-message of keep_hierarchy clarifying this use-case and that outside of it the estimate may no longer be accurate.