Open rachitnigam opened 11 months ago
CC @andrewb1999 @sampsyo @EclecticGriffin
This is a super exciting direction! It's obviously part of a much longer discussion, but just to think through a few high-level discussion points:
Note from looking at timing reports:
Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) : 2.507ns (required time - arrival time)
Source: W_32/G_3970/c0/state0/out_reg[0]/C
(rising edge-triggered cell FDRE clocked by clk {rise@0.000ns fall@2.500ns period=5.000ns})
Destination: W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
(rising edge-triggered cell DSP_A_B_DATA clocked by clk {rise@0.000ns fall@2.500ns period=5.000ns})
Path Group: clk
Path Type: Setup (Max at Slow Process Corner)
Requirement: 5.000ns (clk rise@5.000ns - clk rise@0.000ns)
Data Path Delay: 2.160ns (logic 0.611ns (28.287%) route 1.549ns (71.713%))
Logic Levels: 4 (LUT3=1 LUT6=3)
Clock Path Skew: 0.009ns (DCD - SCD + CPR)
Destination Clock Delay (DCD): 0.044ns = ( 5.044 - 5.000 )
Source Clock Delay (SCD): 0.035ns
Clock Pessimism Removal (CPR): 0.000ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.071ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
(clock clk rise edge) 0.000 0.000 r
0.000 0.000 r clk (IN)
net (fo=412, unset) 0.035 0.035 W_32/G_3970/c0/state0/clk
SLICE_X38Y125 FDRE r W_32/G_3970/c0/state0/out_reg[0]/C
------------------------------------------------------------------- -------------------
SLICE_X38Y125 FDRE (Prop_DFF_SLICEL_C_Q)
0.096 0.131 r W_32/G_3970/c0/state0/out_reg[0]/Q
net (fo=21, routed) 0.467 0.598 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_67_0[0]
SLICE_X39Y124 LUT6 (Prop_E6LUT_SLICEM_I1_O)
0.101 0.699 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77/O
net (fo=2, routed) 0.145 0.844 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77_n_0
SLICE_X39Y124 LUT3 (Prop_H5LUT_SLICEM_I0_O)
0.198 1.042 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69/O
net (fo=8, routed) 0.465 1.507 W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69_n_0
SLICE_X41Y123 LUT6 (Prop_F6LUT_SLICEL_I0_O)
0.177 1.684 r W_32/SER_3/G_3650/c0/state0/mul_uint8_i_26/O
net (fo=1, routed) 0.223 1.907 W_32/G_3970/c0/state0/prev_reg[5]_1
SLICE_X40Y122 LUT6 (Prop_B6LUT_SLICEM_I5_O)
0.039 1.946 r W_32/G_3970/c0/state0/mul_uint8_i_3/O
net (fo=2, routed) 0.249 2.195 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/A[5]
DSP48E2_X3Y50 DSP_A_B_DATA r W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
------------------------------------------------------------------- -------------------
(clock clk rise edge) 5.000 5.000 r
0.000 5.000 r clk (IN)
net (fo=412, unset) 0.044 5.044 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/CLK
DSP48E2_X3Y50 DSP_A_B_DATA r W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/CLK
clock pessimism 0.000 5.044
clock uncertainty -0.035 5.009
DSP48E2_X3Y50 DSP_A_B_DATA (Setup_DSP_A_B_DATA_DSP48E2_CLK_A[5])
-0.307 4.702 W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST
-------------------------------------------------------------------
required time 4.702
arrival time -2.195
-------------------------------------------------------------------
slack 2.507
The fo=21
number here specifies that the fanout factor of W_32/G_3970/c0/state0/out_reg[0]/Q
is 21. Again, this information should be extractable from a parser pretty easily. More information, like the split between the path delay and routing delay can be understood by reading this ChatGPT transcript.
Feedback- or profile-directed optimizations (FDO) generally work by compiling a program once, profiling it at runtime, and then using the profiled information to optimize the program for the particular workload.
In hardware land, XLS uses an FDO approach: it takes a high-level program and conservatively pipelines it, runs it through the synthesis flow, collects information about slack on various paths, and re-pipelines the design to take advantage of the real results from synthesis.
I think we have a big opportunity to do this well with Calyx. Specifically, we can collect information about which sub-circuits are causing congestion/are on the critical path of the design, etc. and tune the various knobs exposed by the compiler. The big novelty win would be optimizing the control circuitry; most HLS compilers are incapable of optimizing the FSMs they generate to schedule the program. Calyx, on the other hand, makes it super easy: our FSM generation is compositional for dynamic programs, and we can easily generate more or less FSMs without affecting correctness.
Here are a couple of other knobs I think are worth thinking about:
go
orfsm
and it is not obvious how to optimize them.The victory condition for this is being able to take a large set of benchmarks and transparently improve their resource usage and frequency characteristics. We should also take inspiration from projects like Autobridge which reason about problems with control signal generation in multi-FPGA designs.