JulianKemmerer / PipelineC

A C-like hardware description language (HDL) adding high level synthesis(HLS)-like automatic pipelining as a language construct/compiler feature.
https://github.com/JulianKemmerer/PipelineC/wiki
GNU General Public License v3.0
588 stars 47 forks source link

add script to parse and estimate delays of unary and binary operations #143

Closed suarezvictor closed 1 year ago

suarezvictor commented 1 year ago

I wrote a model of FPGA delays. It works by parsing the logs at ./path_delay_cache/vivado/xc7a35ticsg324-1l/syn For each entry it calculates the timing depending on the type of operation and bit widths involved, and RMS error overall, dumping the results

Example on the current database: Count: 339 , minimum delay (ns): 1.117 , RMS error (ns): 0.41

The function is called like this: estimate_int_timing("XOR", [16, 16])

returns None if estimation is not available. NOTE: only integer support

TODO: calculate coefficients automatically from the database, and generate the samples by calling the synth project for various kind of operations and bit sizes

JulianKemmerer commented 1 year ago

I'm excited to get to this and try out your models

And then I have a plan to get you more information on not only cached combinatorial delays but also to implement a some kind of caching for fmaxes of pipelined versions of those ops too

One you are at fmaxes at or above where those small ops start to need 1 or more pipelining stages - it becomes a reasonable strategy to 'build from the bottom up': Asking 'how can I meet fmax=F using my cache of pipelined components?'. And is probably almost required as a strategy to meet the absolute higher fmaxes.

If that above is 'bottom up' the one of the current modes when not --coarse is called 'middle out' where the tool doesnt always start with the lowest level blocks but somewhere 'in the middle' as required. Sometimes this degrades into eventually moving from the middle down to 'the lowest level' but this is costly in lots of synthesis runs - bypassing that with caching and modeling sounds great.

Might make for some interesting options for 'give me the faster possible pipeline you think you can make based on cached results'...

suarezvictor commented 1 year ago

If you give me samples of the same that includes the number of stages on the pipeline, I can add the variable to the model

JulianKemmerer commented 1 year ago

What do you think about this branch here as a way to to start incorporating your code? https://github.com/JulianKemmerer/PipelineC/compare/devicemodels

I did a pipelining run on examples/tool_tests/vivado.c (just a single FP adder) and the models were used (see "modeled path delay:" in log and does seem to pipeline :+1: :sunglasses:

██████╗ ██╗██████╗ ███████╗██╗     ██╗███╗   ██╗███████╗ ██████╗
██╔══██╗██║██╔══██╗██╔════╝██║     ██║████╗  ██║██╔════╝██╔════╝
██████╔╝██║██████╔╝█████╗  ██║     ██║██╔██╗ ██║█████╗  ██║     
██╔═══╝ ██║██╔═══╝ ██╔══╝  ██║     ██║██║╚██╗██║██╔══╝  ██║     
██║     ██║██║     ███████╗███████╗██║██║ ╚████║███████╗╚██████╗
╚═╝     ╚═╝╚═╝     ╚══════╝╚══════╝╚═╝╚═╝  ╚═══╝╚══════╝ ╚═════╝

Output directory: /home/julian/pipelinec_output
================== Parsing C Code to Logical Hierarchy ================================
Parsing: /media/1TB/Dropbox/PipelineC/git/PipelineC/main.c
Preprocessing file...
Parsing C syntax...
Parsing non-function definitions...
Parsing derived fsm logic functions...
Doing old-style code generation based on PipelineC supported text patterns...
Parsing function logic...
Vivado: /media/1TB/Programs/Linux/Xilinx/Vivado/2019.2/bin/vivado
Using VIVADO synthesizing for part: xc7a35ticsg324-1l
Parsing function: my_pipeline
Elaborating pipeline hierarchies down to raw HDL logic...
... uint24_negate
... int32_abs
... count0s_uint30
... BIN_OP_PLUS_float_float
... BIN_OP_SR_int31_t_uint8_t
... BIN_OP_SL_uint30_t_uint5_t
Doing obvious logic trimming/collapsing...
Writing generated PipelineC code from elaboration to output directories...
Writing cache of parsed information to file...
================== Writing Resulting Logic to File ================================
Building map of combinatorial logic...
Writing log of floating point module instances: /home/julian/pipelinec_output/float_module_instances.log
Writing log of integer module instances: /home/julian/pipelinec_output/integer_module_instances.log
Writing VHDL files for all functions (as combinatorial logic)...
...
Writing multi main top level files...
Writing the constant struct+enum definitions as defined from C code...
Writing global wire definitions as parsed from C code...
Writing finalized comb. logic synthesis tool files...
Output VHDL files: /home/julian/pipelinec_output/read_vhdl.tcl
================== Adding Timing Information from Synthesis Tool ================================
Synthesizing as combinatorial logic to get total logic delay...

Function: BIN_OP_GT_uint8_t_uint8_t modeled path delay: 2.490 ns
Function: MUX_uint1_t_uint1_t_uint1_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_float_float Cached path delay: 1.547 ns
Function: BIN_OP_EQ_uint8_t_uint1_t modeled path delay: 2.040 ns
Function: MUX_uint1_t_int25_t_int25_t Cached path delay: 1.547 ns
Function: UNARY_OP_NOT_uint25_t modeled path delay: 1.280 ns
Function: BIN_OP_PLUS_uint25_t_uint1_t modeled path delay: 2.920 ns
Function: BIN_OP_MINUS_uint8_t_uint8_t modeled path delay: 2.240 ns
Function: BIN_OP_GT_uint8_t_uint5_t modeled path delay: 2.490 ns
Function: MUX_uint1_t_int31_t_int31_t Cached path delay: 1.547 ns
Function: BIN_OP_PLUS_int31_t_int31_t modeled path delay: 3.160 ns
Function: BIN_OP_MINUS_uint32_t_uint1_t modeled path delay: 3.200 ns
Function: UNARY_OP_NOT_uint32_t modeled path delay: 1.280 ns
Function: BIN_OP_EQ_uint1_t_uint1_t modeled path delay: 1.760 ns
Function: MUX_uint1_t_uint32_t_uint32_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_uint8_t_uint8_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_uint23_t_uint23_t Cached path delay: 1.547 ns
Function: BIN_OP_PLUS_uint8_t_uint1_t modeled path delay: 2.240 ns
Function: BIN_OP_EQ_uint31_t_uint1_t modeled path delay: 2.960 ns
Function: BIN_OP_EQ_uint30_t_uint1_t modeled path delay: 2.920 ns
Function: BIN_OP_EQ_uint2_t_uint1_t modeled path delay: 1.800 ns
Function: BIN_OP_EQ_uint3_t_uint1_t modeled path delay: 1.840 ns
Function: BIN_OP_EQ_uint4_t_uint1_t modeled path delay: 1.880 ns
Function: BIN_OP_EQ_uint5_t_uint1_t modeled path delay: 1.920 ns
Function: BIN_OP_EQ_uint6_t_uint1_t modeled path delay: 1.960 ns
Function: BIN_OP_EQ_uint7_t_uint1_t modeled path delay: 2.000 ns
Function: BIN_OP_EQ_uint9_t_uint1_t modeled path delay: 2.080 ns
Function: BIN_OP_EQ_uint10_t_uint1_t modeled path delay: 2.120 ns
Function: BIN_OP_EQ_uint11_t_uint1_t modeled path delay: 2.160 ns
Function: BIN_OP_EQ_uint12_t_uint1_t modeled path delay: 2.200 ns
Function: BIN_OP_EQ_uint13_t_uint1_t modeled path delay: 2.240 ns
Function: BIN_OP_EQ_uint14_t_uint1_t modeled path delay: 2.280 ns
Function: BIN_OP_EQ_uint15_t_uint1_t modeled path delay: 2.320 ns
Function: BIN_OP_EQ_uint16_t_uint1_t modeled path delay: 2.360 ns
Function: BIN_OP_EQ_uint17_t_uint1_t modeled path delay: 2.400 ns
Function: BIN_OP_EQ_uint18_t_uint1_t modeled path delay: 2.440 ns
Function: BIN_OP_EQ_uint19_t_uint1_t modeled path delay: 2.480 ns
Function: BIN_OP_EQ_uint20_t_uint1_t modeled path delay: 2.520 ns
Function: BIN_OP_EQ_uint21_t_uint1_t modeled path delay: 2.560 ns
Function: BIN_OP_EQ_uint22_t_uint1_t modeled path delay: 2.600 ns
Function: BIN_OP_EQ_uint23_t_uint1_t modeled path delay: 2.640 ns
Function: BIN_OP_EQ_uint24_t_uint1_t modeled path delay: 2.680 ns
Function: BIN_OP_EQ_uint25_t_uint1_t modeled path delay: 2.720 ns
Function: BIN_OP_EQ_uint26_t_uint1_t modeled path delay: 2.760 ns
Function: BIN_OP_EQ_uint27_t_uint1_t modeled path delay: 2.800 ns
Function: BIN_OP_EQ_uint28_t_uint1_t modeled path delay: 2.840 ns
Function: BIN_OP_EQ_uint29_t_uint1_t modeled path delay: 2.880 ns
Function: MUX_uint1_t_uint2_t_uint2_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_uint3_t_uint3_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_uint4_t_uint4_t Cached path delay: 1.547 ns
Function: MUX_uint1_t_uint5_t_uint5_t Cached path delay: 1.547 ns
Function: BIN_OP_OR_uint1_t_uint2_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint2_t_uint3_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint3_t_uint3_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint3_t_uint4_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint4_t_uint4_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint4_t_uint5_t modeled path delay: 1.280 ns
Function: BIN_OP_OR_uint5_t_uint5_t modeled path delay: 1.280 ns
Function: BIN_OP_MINUS_uint8_t_uint5_t modeled path delay: 2.240 ns
Function: MUX_uint1_t_uint30_t_uint30_t Cached path delay: 1.547 ns
Function: uint24_negate Cached path delay: 2.821 ns
Function: int32_abs Cached path delay: 3.564 ns
Function: count0s_uint30 Cached path delay: 4.660 ns
Function: BIN_OP_SR_int31_t_uint8_t modeled path delay: 3.420 ns
Function: BIN_OP_SL_uint30_t_uint5_t modeled path delay: 3.420 ns
Function: BIN_OP_PLUS_float_float Cached path delay: 19.702 ns
Synthesizing function: my_pipeline
...Waiting on synthesis for: my_pipeline
Running: /home/julian/pipelinec_output/examples/pipeline.c/my_pipeline/vivado_0CLK_8e7e7557.log
my_pipeline Path delay: 19.702 ns (50.756 MHz)

Function 165/165, elapsed time 0:01:56.199407...
Updating modules instances log to list longest delay, most used modules only...
================== Beginning Throughput Sweep ================================
Function: my_pipeline Target MHz: 150.0
Setting all instances to comb. logic to start...
Starting with blank sweep state...
Starting middle out sweep...
Starting from zero clk timing params...
Collecting modules to pipeline...
Pipelining modules...
...
my_pipeline Clock Goal: 150.00 (MHz) Current: 194.93 (MHz)(5.13 ns) 6 clks
Met timing...
================== Writing Results of Throughput Sweep ================================
Output VHDL files: /home/julian/pipelinec_output/read_vhdl.tcl
Done.

What do you think about me getting this code onto the main branch soon? I'd want to switch the if-else logic to prefer the cached delayed and only use model if not in cache for now - can follow up with testing models more to see if they are better default.

suarezvictor commented 1 year ago

This is lovely! I should model the MUX too. Can you try that happens if model has more prirority than cache? Did it work? There was any processing time gain? Please comment your view ;)

JulianKemmerer commented 1 year ago

That is what the above shows - model with priority over cache :) (notice only few things cached in log) And it works :+1:

suarezvictor commented 1 year ago

Excellent! Can you delete the cache and try peocessing time with and without the model?

JulianKemmerer commented 1 year ago

Processing time gains will happen when synthesizing something that isn't in the cache (i.e. when you save time not running synthesis). So would need to test something not as familiar as FP adds and stuff

Yeah or deleting the cache would be one way to test that as well... and Ill get to it as I have time :+1:

JulianKemmerer commented 1 year ago

No cache or models, all ops synthesized: ~35 minutes With fast cache/models: ~6 minutes

Certainly alot faster than needing to synthesize every op :+1: , looks great

I am going to merge this soon - and can proceed from there

JulianKemmerer commented 1 year ago

Merged as part of https://github.com/JulianKemmerer/PipelineC/pull/144

suarezvictor commented 1 year ago

How cool!! it took me half a day to code it :-)

suarezvictor commented 1 year ago

how to continue with this outstanding results? Testing it with a larger project like the raytracer? Implementing a model of simple arithmetic/logical operations but with a multistage pipeline? Making the database larger with trying more bit width combinations? analyzing other FPGA devices?