Static Promotion Dramatically Increases Register Usage

calebmkim commented 1 year ago

The main reason is because we aggressively promote control to static... for example, this is what the control program might look like post static-promotion.

control {
  seq {
    static<3> seq  {...}
    while comb_reg.out {
      seq {
        static<2> seq  {...}
        while comb_reg0.out {
          seq {
            static<4> seq  {...}
            while comb_reg1.out {
              static<10> seq  {...}
            }
            static<2> seq  {...}
            b0_17; // group that performs signed division 
            static<3> seq  {...}
          }
        }
      }
    }
    static<2> seq  {...}
  }
}

As you can see, there are a bunch of "static islands". For each static island, we will instantiate two registers: one fsm and a one-bit signal_reg (this is the register that checks whether we have seen zero before, so that the wrapper can be done when the fsm reaches 0 for a second time).

Possible Solutions

My immediate thought was that we could try to run the cell sharing pass after we have compiled the "static islands" but before we have compiled tdcc, so that we can share all of the fsm and signal_reg of all the different "static islands".

The problem is that there is a continuous assignment that uses both fsm and signal_reg when we compile, making it impossible to share.

As a reminder, here is how we create a static_wrapper_group for a static_group with a static_group_fsm.

static_wrapper_group {
  ... 
  // signal_reg is high if static_group_fsm has reached 0 before
  signal_reg.in = %0 & static_group_fsm == 0 ? 1
  // static_wrapper_group is done if static_group_fsm has reached 0 for the second time
  static_wrapper_group[done] = signal_reg.out == 1 & static_group_fsm == 0 ? 1
}
// This continuous assignment resets signal_reg back to 0 during the cycle that static_wrapper_group's done condition is triggered, so we can execute static_wrapper_group again if necessary
signal_reg.in = signal_reg.out == 1 & static_group_fsm == 0 ? 0

(maybe a bad idea) Could we try to change the way we compile the static_wrapper_group so that it will be possible to share the fsms and signal_reg of all the "static islands"?

rachitnigam commented 1 year ago

Ah, that's tricky! I think we should look at some combined latency, area graphs first before making a decision: standardize the numbers using the baseline, non-static case and put them into one chart.

The worst benchmark is using ~60% more. If it is the case that the latency is dramatically better, then we're in a tricky spot because it represents an area latency trade-off. On the other hand, if the area is worse and the latency isn't that much better, we should start thinking about heuristics to disable static promotion in specific cases.

rachitnigam commented 1 year ago

Also, does this get better with repeat blocks?

sampsyo commented 1 year ago

That's super interesting. While this is only a 1/4-formed thought, it seems relevant that this case exhibits a pretty specific kind dynamic/static intermixing: all the loops are dynamic, and everything else is static. (Except for the division—is that supposed to be dynamic, or have we just not given it a static implementation yet?)

It probably amounts to recapitulating your existing proposals, @calebmkim (i.e., can we share the control registers generated for each static island?), but it seems worth thinking about what the control would look like if you were to design it by hand for such a loop nest.

One option might be that you'd use a single FSM/signal register pair, shared among all the static stages (this is probably what generic sharing would do). Another could be that, for a mixed static/dynamic seq like the innermost one in your code, further simplified thus:

seq {
  static<4> a;
  b;
  static<2> c;
  d;
  static<3> e;
}

…we somehow try to use only a single FSM register to track status through the whole thing. That register would hold a value ranging from 0 to 4+1+2+1+3=10, i.e., it would have one state for every cycle of every static group and one state for each dynamic group. The control logic would roughly say:

if the state indicates we are in a static group:
  state++
else:  # the state indicates we are in a dynamic group
  if that group is done:
    state++

Anyway, that by itself may be a terrible idea, but maybe thinking about what an ideal set of control logic would look like here could help reveal a way forward.

calebmkim commented 1 year ago

Also, does this get better with repeat blocks?

From a quick look at the resource estimation backend, it seems like it probably does. (although as this notebook entry explains, we can't compile for loops using repeats for every polybench benchmark). Although, we should run it through the full Vivado Synthesis stuff to check exactly how much it improves the design.

One thing that I'm noticing is preventing more static promotion is that combinational reads of std_mem are preventing group2seq from working (since std_mem is not a combinational cell, when we write to its address ports and then combinationally read its value, it messes up group2seq, since group2seq can't tell that the std_mem is "acting combinational" in this case). Another reason to deprecate std_mem I suppose.

To summarize, I think there are probably two things to do: 1) Try to think of a way to compile static islands such that only one FSM is used. @sampsyo gave some possible starting points on how to tackle this problem.
2) Once we try 1), we look at the area/latency trade-off at that stage and think about changing our heuristics to save area.

rachitnigam commented 1 year ago

I think one reasonable heuristic is to only promote the innermost computation in a nest of many dynamically scheduled loops. This means that the computation that runs the most often gets staticified

sampsyo commented 1 year ago

Another reason to deprecate std_mem I suppose.

Hard to disagree with this!

calebmkim commented 1 year ago

Closed by https://github.com/cucapra/calyx/pull/1632

cucapra / calyx-resource-eval

Static Promotion Dramatically Increases Register Usage #4

Possible Solutions