Closed calebmkim closed 1 year ago
Ah, that's tricky! I think we should look at some combined latency, area graphs first before making a decision: standardize the numbers using the baseline, non-static case and put them into one chart.
The worst benchmark is using ~60% more. If it is the case that the latency is dramatically better, then we're in a tricky spot because it represents an area latency trade-off. On the other hand, if the area is worse and the latency isn't that much better, we should start thinking about heuristics to disable static promotion in specific cases.
Also, does this get better with repeat
blocks?
That's super interesting. While this is only a 1/4-formed thought, it seems relevant that this case exhibits a pretty specific kind dynamic/static intermixing: all the loops are dynamic, and everything else is static. (Except for the division—is that supposed to be dynamic, or have we just not given it a static implementation yet?)
It probably amounts to recapitulating your existing proposals, @calebmkim (i.e., can we share the control registers generated for each static island?), but it seems worth thinking about what the control would look like if you were to design it by hand for such a loop nest.
One option might be that you'd use a single FSM/signal register pair, shared among all the static stages (this is probably what generic sharing would do). Another could be that, for a mixed static/dynamic seq
like the innermost one in your code, further simplified thus:
seq {
static<4> a;
b;
static<2> c;
d;
static<3> e;
}
…we somehow try to use only a single FSM register to track status through the whole thing. That register would hold a value ranging from 0 to 4+1+2+1+3=10, i.e., it would have one state for every cycle of every static group and one state for each dynamic group. The control logic would roughly say:
if the state indicates we are in a static group:
state++
else: # the state indicates we are in a dynamic group
if that group is done:
state++
Anyway, that by itself may be a terrible idea, but maybe thinking about what an ideal set of control logic would look like here could help reveal a way forward.
Also, does this get better with repeat blocks?
From a quick look at the resource estimation backend, it seems like it probably does. (although as this notebook entry explains, we can't compile for loops using repeats for every polybench benchmark). Although, we should run it through the full Vivado Synthesis stuff to check exactly how much it improves the design.
One thing that I'm noticing is preventing more static promotion is that combinational reads of std_mem
are preventing group2seq
from working (since std_mem
is not a combinational cell, when we write to its address ports and then combinationally read its value, it messes up group2seq
, since group2seq
can't tell that the std_mem
is "acting combinational" in this case). Another reason to deprecate std_mem
I suppose.
To summarize, I think there are probably two things to do:
1) Try to think of a way to compile static islands such that only one FSM is used. @sampsyo gave some possible starting points on how to tackle this problem.
2) Once we try 1), we look at the area/latency trade-off at that stage and think about changing our heuristics to save area.
I think one reasonable heuristic is to only promote the innermost computation in a nest of many dynamically scheduled loops. This means that the computation that runs the most often gets staticified
Another reason to deprecate
std_mem
I suppose.
Hard to disagree with this!
The main reason is because we aggressively promote control to static... for example, this is what the control program might look like post
static-promotion
.As you can see, there are a bunch of "static islands". For each static island, we will instantiate two registers: one
fsm
and a one-bitsignal_reg
(this is the register that checks whether we have seen zero before, so that the wrapper can be done when thefsm
reaches 0 for a second time).Possible Solutions
My immediate thought was that we could try to run the cell sharing pass after we have compiled the "static islands" but before we have compiled tdcc, so that we can share all of the
fsm
andsignal_reg
of all the different "static islands".The problem is that there is a continuous assignment that uses both
fsm
andsignal_reg
when we compile, making it impossible to share.As a reminder, here is how we create a
static_wrapper_group
for astatic_group
with astatic_group_fsm
.(maybe a bad idea) Could we try to change the way we compile the
static_wrapper_group
so that it will be possible to share thefsms
andsignal_reg
of all the "static islands"?