JulianKemmerer / PipelineC

A C-like hardware description language (HDL) adding high level synthesis(HLS)-like automatic pipelining as a language construct/compiler feature.
https://github.com/JulianKemmerer/PipelineC/wiki
GNU General Public License v3.0
588 stars 48 forks source link

Make PipelineC LUT aware #45

Open JulianKemmerer opened 2 years ago

JulianKemmerer commented 2 years ago

A whole world of optimizations exists at the LUT level. Especially post PNR.

@suarezvictor was quick to point out that FPGAs have essentially 'free' registers that make pipelining easy. You could even turn on the registers between every single LUT for maximum FMAX.

suarezvictor commented 2 years ago

Related paper: https://www.icsi.berkeley.edu/~nweaver/papers/2003-cslow.pdf

bartokon commented 2 years ago

Good idea for next stages of the project aka ultra fine tuning.

BTW. synth tools should infer regs automatically. We could set lut for some fpga architecture and search for logic gates that could be represented by 3/2 or 6/1lut and after that extraction place one reg there. But this is too low level imo. Maybe we could suggest it to pyrtl?

JulianKemmerer commented 2 years ago

Yeah I think suggesting something like some basic FPGA arch modeling as part of pyrtl - to accompany their asic modeling - makes alot of sense (like you said pick a LUT-N arch of some sort)

JulianKemmerer commented 2 years ago

Ultra fine tuning yes I like that phrase

In context there is the lowest level of feedback (which pyrtl recently newly can provide) which is what is the critical path delay? and thats it. The tool can blindly use that single number to try and adjust its pipelining guesses.

Then there is slightly better which MAIN function specifically did the critical path occur in? which only applies for designs with multiple MAIN funcs and/or multiple clock domains.

There is next a currently-broken fine grain mode of which submodule instance exactly was the critical path inside? for even more targeted pipelining guess iterations. But that requires interpreting the syn+pnr output and tracing back to the original module in VHDL - which given all the optimizations and name mangling - is quite hard to do reliably.

And then how fun to think about this ultra fine grain mode of figuring out which post PNR LUTs correspond to what original HDL modules - for deciding where to turn on those free pipelining regs (unless trying that 'turn on all regs' mode to experiment with)

JulianKemmerer commented 2 years ago

Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.

suarezvictor commented 2 years ago

To me initially it shouldn't support all kinds of chips, only the ones supported by open source tools (yosys and nextpnr). Having all the data at hand, a better tool can be designed. Then, when it works, it could be ported to commercially supported chips.

On Fri, Nov 19, 2021 at 11:33 AM Julian Kemmerer @.***> wrote:

Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JulianKemmerer/PipelineC/issues/45#issuecomment-974123418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBHVWKM6ITCP4JAFIXBFDLUMZN4TANCNFSM5ILGJJ4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bartokon commented 2 years ago

I think most of the time FPGA support (right now) 3/2 luts or 6/1. Anyway if we could perform opt by putting registers after it uses 3 input function that have 2 outputs etc. Xilinx CLB have 2 slices and each slice have mux on output. image

JulianKemmerer commented 2 years ago

I think if we took every PipelineC 'raw VHDL' pipelineable primitive, for ex. simple add operator (can divide up an N bit add into however many stages you want*).

What I would want to know is something like If you want to es. 'cut a 64b add into 3 cycles' how many bits per stage is that such that we make best use of LUTs.

Maybe could start at the maximum and reverse - what is the upper limit of pipelining a ex. 64b adder - ex. how many bits per stage maps to the highest fmax pipeline

I did some experimenting once and there are defintely things like idk a 7b adder is less delay than a 3b adder or something like that - its not a simple mapping of bits per stage to delay

Blah blah let me know if this rant makes sense

JulianKemmerer commented 2 years ago

Remembered this issue

why not call it the -O3 flag and say 'now your rendered HDL is unreadable/a netlist of LUTs'?

So the PipelineC HDL gets synthesized to LUTs first -> re imported as a PipelineC Dataflow to be pipelined= each LUT/~prim is a C func kinda thinking

JulianKemmerer commented 1 year ago

To be clear it is possible today to write C functions wrapping raw VHDL that instantiated LUTs and from there you could with some extra work get the compiler to user various LUTs with or with IO registers to construct pipeline primitives up to enough to replace PipelineC raw VHDL operators, ex. add two u32's

JulianKemmerer commented 1 year ago

Thanks Bartus: https://essay.utwente.nl/79103/1/Kruiper_BA_EEMCS.pdf

suarezvictor commented 1 year ago

This Bartus' paper is so good I have an application that needs 8-bit multipliers, in the paper it's shown how to reach 410MHz using LUTs and pipelining, instead of 257MHz with DSPs

JulianKemmerer commented 1 year ago

Likely part of #46 and #48 too

Once dealing with device specific netlists, might as well also see if tools provide .sdf output which should detail timing of each LUT IIUC

https://en.wikipedia.org/wiki/Standard_Delay_Format#:~:text=Standard%20Delay%20Format%20(SDF)%20is,verification%20and%20static%20timing%20analysis.

Thanks @suarezvictor for bringing up