calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
465 stars 47 forks source link

Feedback Directed Optimization #1834

Open rachitnigam opened 6 months ago

rachitnigam commented 6 months ago

Feedback- or profile-directed optimizations (FDO) generally work by compiling a program once, profiling it at runtime, and then using the profiled information to optimize the program for the particular workload.

In hardware land, XLS uses an FDO approach: it takes a high-level program and conservatively pipelines it, runs it through the synthesis flow, collects information about slack on various paths, and re-pipelines the design to take advantage of the real results from synthesis.

I think we have a big opportunity to do this well with Calyx. Specifically, we can collect information about which sub-circuits are causing congestion/are on the critical path of the design, etc. and tune the various knobs exposed by the compiler. The big novelty win would be optimizing the control circuitry; most HLS compilers are incapable of optimizing the FSMs they generate to schedule the program. Calyx, on the other hand, makes it super easy: our FSM generation is compositional for dynamic programs, and we can easily generate more or less FSMs without affecting correctness.

Here are a couple of other knobs I think are worth thinking about:

The victory condition for this is being able to take a large set of benchmarks and transparently improve their resource usage and frequency characteristics. We should also take inspiration from projects like Autobridge which reason about problems with control signal generation in multi-FPGA designs.

rachitnigam commented 6 months ago

CC @andrewb1999 @sampsyo @EclecticGriffin

sampsyo commented 6 months ago

This is a super exciting direction! It's obviously part of a much longer discussion, but just to think through a few high-level discussion points:

rachitnigam commented 3 months ago

Note from looking at timing reports:

Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) :             2.507ns  (required time - arrival time)
  Source:                 W_32/G_3970/c0/state0/out_reg[0]/C
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@2.500ns period=5.000ns})
  Destination:            W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
                            (rising edge-triggered cell DSP_A_B_DATA clocked by clk  {rise@0.000ns fall@2.500ns period=5.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            5.000ns  (clk rise@5.000ns - clk rise@0.000ns)
  Data Path Delay:        2.160ns  (logic 0.611ns (28.287%)  route 1.549ns (71.713%))
  Logic Levels:           4  (LUT3=1 LUT6=3)
  Clock Path Skew:        0.009ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    0.044ns = ( 5.044 - 5.000 ) 
    Source Clock Delay      (SCD):    0.035ns
    Clock Pessimism Removal (CPR):    0.000ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
                                                      0.000     0.000 r  clk (IN)
                         net (fo=412, unset)          0.035     0.035    W_32/G_3970/c0/state0/clk
    SLICE_X38Y125        FDRE                                         r  W_32/G_3970/c0/state0/out_reg[0]/C
  -------------------------------------------------------------------    -------------------
    SLICE_X38Y125        FDRE (Prop_DFF_SLICEL_C_Q)
                                                      0.096     0.131 r  W_32/G_3970/c0/state0/out_reg[0]/Q
                         net (fo=21, routed)          0.467     0.598    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_67_0[0]
    SLICE_X39Y124        LUT6 (Prop_E6LUT_SLICEM_I1_O)
                                                      0.101     0.699 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77/O
                         net (fo=2, routed)           0.145     0.844    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_77_n_0
    SLICE_X39Y124        LUT3 (Prop_H5LUT_SLICEM_I0_O)
                                                      0.198     1.042 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69/O
                         net (fo=8, routed)           0.465     1.507    W_32/SER_3/G_3650/c0/state0/mul_uint8_i_69_n_0
    SLICE_X41Y123        LUT6 (Prop_F6LUT_SLICEL_I0_O)
                                                      0.177     1.684 r  W_32/SER_3/G_3650/c0/state0/mul_uint8_i_26/O
                         net (fo=1, routed)           0.223     1.907    W_32/G_3970/c0/state0/prev_reg[5]_1
    SLICE_X40Y122        LUT6 (Prop_B6LUT_SLICEM_I5_O)
                                                      0.039     1.946 r  W_32/G_3970/c0/state0/mul_uint8_i_3/O
                         net (fo=2, routed)           0.249     2.195    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/A[5]
    DSP48E2_X3Y50        DSP_A_B_DATA                                 r  W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/A[5]
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)        5.000     5.000 r  
                                                      0.000     5.000 r  clk (IN)
                         net (fo=412, unset)          0.044     5.044    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/CLK
    DSP48E2_X3Y50        DSP_A_B_DATA                                 r  W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST/CLK
                         clock pessimism              0.000     5.044    
                         clock uncertainty           -0.035     5.009    
    DSP48E2_X3Y50        DSP_A_B_DATA (Setup_DSP_A_B_DATA_DSP48E2_CLK_A[5])
                                                     -0.307     4.702    W_32/C_2/C_2/PC_2/CONV_15/M9_13/MUL_3/mul_uint8/U0/i_mult/gDSP.gDSP_only.iDSP/inferred_dsp.use_p_reg.p_reg_reg/DSP_A_B_DATA_INST
  -------------------------------------------------------------------
                         required time                          4.702    
                         arrival time                          -2.195    
  -------------------------------------------------------------------
                         slack                                  2.507    

The fo=21 number here specifies that the fanout factor of W_32/G_3970/c0/state0/out_reg[0]/Q is 21. Again, this information should be extractable from a parser pretty easily. More information, like the split between the path delay and routing delay can be understood by reading this ChatGPT transcript.