Open JulianKemmerer opened 2 years ago
Related paper: https://www.icsi.berkeley.edu/~nweaver/papers/2003-cslow.pdf
Good idea for next stages of the project aka ultra fine tuning.
BTW. synth tools should infer regs automatically. We could set lut for some fpga architecture and search for logic gates that could be represented by 3/2 or 6/1lut and after that extraction place one reg there. But this is too low level imo. Maybe we could suggest it to pyrtl?
Yeah I think suggesting something like some basic FPGA arch modeling as part of pyrtl - to accompany their asic modeling - makes alot of sense (like you said pick a LUT-N arch of some sort)
Ultra fine tuning
yes I like that phrase
In context there is the lowest level of feedback (which pyrtl recently newly can provide) which is what is the critical path delay?
and thats it. The tool can blindly use that single number to try and adjust its pipelining guesses.
Then there is slightly better which MAIN function specifically did the critical path occur in?
which only applies for designs with multiple MAIN funcs and/or multiple clock domains.
There is next a currently-broken fine grain
mode of which submodule instance exactly was the critical path inside?
for even more targeted pipelining guess iterations. But that requires interpreting the syn+pnr output and tracing back to the original module in VHDL - which given all the optimizations and name mangling - is quite hard to do reliably.
And then how fun to think about this ultra fine grain
mode of figuring out which post PNR LUTs correspond to what original HDL modules - for deciding where to turn on those free pipelining regs (unless trying that 'turn on all regs' mode to experiment with)
Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.
To me initially it shouldn't support all kinds of chips, only the ones supported by open source tools (yosys and nextpnr). Having all the data at hand, a better tool can be designed. Then, when it works, it could be ported to commercially supported chips.
On Fri, Nov 19, 2021 at 11:33 AM Julian Kemmerer @.***> wrote:
Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JulianKemmerer/PipelineC/issues/45#issuecomment-974123418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBHVWKM6ITCP4JAFIXBFDLUMZN4TANCNFSM5ILGJJ4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I think most of the time FPGA support (right now) 3/2 luts or 6/1. Anyway if we could perform opt by putting registers after it uses 3 input function that have 2 outputs etc. Xilinx CLB have 2 slices and each slice have mux on output.
I think if we took every PipelineC 'raw VHDL' pipelineable primitive, for ex. simple add operator (can divide up an N bit add into however many stages you want*).
What I would want to know is something like If you want to es. 'cut a 64b add into 3 cycles' how many bits per stage is that such that we make best use of LUTs.
Maybe could start at the maximum and reverse - what is the upper limit of pipelining a ex. 64b adder - ex. how many bits per stage maps to the highest fmax pipeline
I did some experimenting once and there are defintely things like idk a 7b adder is less delay than a 3b adder
or something like that - its not a simple mapping of bits per stage to delay
Blah blah let me know if this rant makes sense
Remembered this issue
why not call it the -O3 flag and say 'now your rendered HDL is unreadable/a netlist of LUTs'?
So the PipelineC HDL gets synthesized to LUTs first -> re imported as a PipelineC Dataflow to be pipelined= each LUT/~prim is a C func kinda thinking
To be clear it is possible today to write C functions wrapping raw VHDL that instantiated LUTs and from there you could with some extra work get the compiler to user various LUTs with or with IO registers to construct pipeline primitives up to enough to replace PipelineC raw VHDL operators, ex. add two u32's
Thanks Bartus: https://essay.utwente.nl/79103/1/Kruiper_BA_EEMCS.pdf
This Bartus' paper is so good I have an application that needs 8-bit multipliers, in the paper it's shown how to reach 410MHz using LUTs and pipelining, instead of 257MHz with DSPs
Likely part of #46 and #48 too
Once dealing with device specific netlists, might as well also see if tools provide .sdf output which should detail timing of each LUT IIUC
Thanks @suarezvictor for bringing up
A whole world of optimizations exists at the LUT level. Especially post PNR.
@suarezvictor was quick to point out that FPGAs have essentially 'free' registers that make pipelining easy. You could even turn on the registers between every single LUT for maximum FMAX.