Better search for fmax goals

JulianKemmerer commented 2 years ago

Relating to, but not the same as Faster timing estimates https://github.com/JulianKemmerer/PipelineC/issues/46

After having some mechanism to get timing estimates^ - aka able to do circuit input -> timing info out - how does PipelineC then use that info to search for a particular fmax goal in the fastest way possible?

JulianKemmerer commented 2 years ago

Both @bartokon and @suarezvictor have said things along the lines of do a binary search.

bartokon has some experimental plots of the #stages vs fmax curves we would be "searching-along" and its not clear a binary search is going to work?

JulianKemmerer commented 2 years ago

It is important to note the search is not always get to the absolute fmax possible. We should have a mode for that. But also more typical get to this specific fmax goal searching towards a specific fmax (at lowest latency possible implied) is also needed. You can even just set the specific fmax goal to your device spec'd fmax by the manufacturer to approximate 'as fast as possible'. Ill comment on what the basic search tool currently does soon.

bartokon commented 2 years ago

Decision trees can get stuck in local minimum. We could make some border cases for search and try to offset the problem.

bartokon commented 2 years ago

For now, I think, we should analyze pyrtl feedback with critical times and group wires to distinct speed categories (time delay categories) and put registers there for example: https://machinelearningmastery.com/clustering-algorithms-with-python/ K-Meas? Two categories, fast and slow? Some other methods to deduct what wires need to be sliced to achieve better fmax?

JulianKemmerer commented 2 years ago

I dont think I understand analyze pyrtl feedback with critical times and group wires to distinct speed categories and put registers there

I think that assumes way too much of how well names through GHDL->Yosys->BLIF-pyrtl are preserved at the moment. But maybe not / could imagine that not being an issue...

Ill explain how the tool currently places registers...

JulianKemmerer commented 2 years ago

Say you have a my_pipeline that is reported to have a total delay of 100ns, thats comb. logic without pipeline registers, etc

Say your device fmax is 500MHz(2ns period). And you want to reach that absolute fmax. Great says the tool. To meet an fmax of 2ns period, we need to divide our 100ns of delay 100ns/2ns = 50 equal stages.

Then it goes at the task of divide my_pipeline into 50 stages. How does it do that? It looks at the delay of my_pipeline submodules/function calls. Using those delays, and the dataflow of the C code, it makes a directed acyclic graph (DAG). Given the 'layout' of the DAG it becomes apparent that something like 'cut my_pipeline here' equates to 'cut specifically that submodule there'. That process repeats for all submodules, all the way down the call stack until 'cut some_pipeline here' is just rendering a chunk of VHDL for a small some_pipeline operation (think add/sub etc)

JulianKemmerer commented 2 years ago

Then it gets feedback from the syn+pnr tool saying. Your slicing of 50 stages actually got to ~ 350MHz. What do we do next? It roughly (some hand crafted stuff here) says: If 50 stages actually works out to 350MHz(2.85 ns period) that would seem the actual total comb. delay in this circuit is ~50 stages * 2.85ns = 142ns. Repeat the above steps as if the circuit had 142ns of total delay from the start as better estimate.

JulianKemmerer commented 2 years ago

And its that kind of iterative looping - finding out you were off, and trying to re-adjust that exists.

Important: this method, and anything to immediately replace it works just off the single number what is the fmax of the circuit (i.e. max path delay in ns) from the syn+pnr tool. That makes it super easy to add support for other tools / timing models.

More detailed feedback of 'the critical path was in this specific submodule, stage 3' is possible but not for all tools/flows easily.

bartokon commented 2 years ago

Aaaahhh, okay now I know how it works. I thought that you analyze on netlist level, and if you are pipelineing a netlist it could be directly inserted into fpga and resynth (pipeline not C model but netlist). So if some net connection in other words path is loooong. You could slice it in half, put register in the middle, that way we could aim at critical paths + if netlist is supported all other tools and even encrypted ip are supported IF you have a netlist.

So my proposal is too low level for now, I think. So the best we can do is detect the worst module and slice it N times. We could slice each module separately with multiprocessing and detect optimal slice level with for example binary search (or as I would like to have some fun with some ML models and predict as much slicing parameters as I posibly can :D).

JulianKemmerer commented 2 years ago

Yeah thats good thinking - perhaps what actually occurs for larger non---coarse designs. In that mode, ex. for our 2ns period target the tool does a slighty different sweep: it asks what are the smallest submodules are larger than 2ns in delay? It might say "your 32b multiplier is smallest submodule larger than 2ns in comb delay". Then instead of saying 'slice my_pipeline' it says, slice all 32b multipliers like so to meet fmax. And then given that slicing of smaller submodules rebuild the entire top level my_pipeline out of those pieces that supposedly meet timing.

Regardless the 'slice a module N times' is roughly the same process whether its to the top level or not (its that 'down the call stack' slicing until its raw VHDL you are manipulating)

JulianKemmerer commented 2 years ago

Definitely down to have some fun ML experiments to see what better predictions can be done :D

suarezvictor commented 2 years ago

Better not to reinvent the wheel https://github.com/MattRighetti/leiserson-retiming/wiki

bartokon commented 2 years ago

Better not to reinvent the wheel https://github.com/MattRighetti/leiserson-retiming/wiki

Haha can't wait for new pipelinec v2 :D

JulianKemmerer commented 2 years ago

Oh gosh v2 :hourglass_flowing_sand:

We model a circuit as a graph in which the vertex set V is a collection of combinational logic elements and the edge set E is the set of interconnections

this connects to https://github.com/JulianKemmerer/PipelineC/issues/45 as well.

If I were modeling at the LUT level - then each LUT becomes a comb. element in the network that the re-timing tool is analyzing. Doesnt need to be LUT level - but just some fixed combinatorial primitives seem to be needed.

I dont have those really? My comb. primitives can be split / sliced. So I dont know what the smallest comb. chunks are for the tool to move around?

Maybe I need to ask - does that retiming tool have a way to say 'split this comb. element delay in half'?

At the moment sounds like version 2 for sure - I need to think about how the current way things works would map to some fixed comb. prim network you could throw into some other tool.

bartokon commented 2 years ago

Vivado artix_times.txt nextopnr nextpnr_times.txt

JulianKemmerer commented 1 year ago

Can try to follow up on this work here https://github.com/JulianKemmerer/PipelineC/blob/master/src/DEVICE_MODELS.py

suarezvictor commented 1 year ago

Let me know if you agree with what I musing, in regards to estimate the larger delay of a path in netlist: We'll have an estimation, then when you send the netlist to synthesis, you'll have the "real" value. Let's assume that both our model and synth produce a strictly monotonic (that mean "always increasing") delay result in function of the number of path length and stages. So our model may be different than reality, but since there would be a search of the optimal point (using strictly monotonic functions in both cases), we can stick to the "real results" for the search, and correct the model with the difference from reality. In that way, we'll using a differential function, that is, an estimation of "how much more or less delay we get if we increase or decrease, respectively, the netlist". It seems that would work, with the errors in the model just affecting how fast we can find the sweet point. A good property of this, if it holds true, is that the naive "sum all individual delays" on the path, will keep the function monotonic, as required. UPDATE: note that to get the "differential" version of the estimator (i.e how a change produces a change), the absolute version can be called with the "old" and "new" values and them simply substracted

JulianKemmerer commented 1 year ago

Please see #46 for recent new caching stuff of period values for pipelined built in functions ex. pipeline_min_period_cache/vivado/xc7a100tcsg324-1/syn/BIN_OP_PLUS_float_8_14_t_float_8_14_t_0.25.delay

suarezvictor commented 1 year ago

the RWRoute paper has timing model equations used in RapidWright (see Page 13) http://www.rapidwright.io/docs/Papers.html

JulianKemmerer commented 1 year ago

Ooo nice - I think trying to use RapidWright ~instead of Vivado is a option that could be ~easily explored for getting faster timing modeling with Xilinx FPGAs

Using their equations or something like it to derive our own timing models for all/any fpga architecture also sounds possible - but is indeed a more difficult/experimental long term goal

JulianKemmerer / PipelineC

Better search for fmax goals #48