Open parthsarkar17 opened 2 weeks ago
Static repeats are the interesting bit. We surely wouldn't want to make users create distinct states for the same loop body, so maybe there can be a
repeat
construct within thestaticfsm
block somehow?
For static repeat we can use a common trick to avoid state explosion, create a separate binary counter register that is incremented every cycle when in a given FSM state. We then transition to the next fsm state when the counter reaches a certain value. This could look some like this (in super fake syntax, just showing the general idea):
reg counter<32>;
cmp = counter == repeat_len;
staticfsm my_fsm {
...
s0 : {incr(counter)} => s1 // start of static repeat
s1 : { loop body state 1 } => s2 // repeat body with static latency of 2
s2 : { loop body state 2 } => { cmp : s3, !cmp : s0}
s3 : { ... } => s4 // after loop body
...
}
So, if we want to emit these case statements via the Verilog backend, it would be nice to have an IR construct within the compiler to store these transitions
I think we have a broader goal here than just emitting FSMs for Vivado to recognize. Broadly, we want to be able to optimize decisions about how a control program is implemented using the FSM representation.
implicit loada.done guard here, plus implicit self-loop if any transition condition not met
I don't like the implicit nature of these guards and transitions. I think we should be very explicit with what the guards and transitions are:
0: { loada; } => loada[done] ? 1; default: 0
The default
case must be specified for each transition and should explicitly point out what the state of the FSM is going to be if no transition is valid.
Based on this idea, with nested control and offloaded FSMs, we need only check the "leaf" FSM state
I like this idea! I can't find the issue but @paili0628 was working on changing the go
-done
interface so we don't have to assert the go
signal for the duration of the execution.
Another a possible strawperson proposal for combining static and dynamic FSMs:
staticfsm my_fsm {
s0 : { mult } => %3 ? s1,
The %3
guard is the usual timing guard that asserts it will be true on the third cycle.
I don't like the implicit nature of these guards and transitions. I think we should be very explicit with what the guards and transitions are:
My thought process here was, if we have a construct for specifying dynamic FSMs in which groups are invoke-able (and, say, arbitrary assignments are not allowed), then some redundancy may exist if we expect users to list all conditions of transitioning to a new state, including the done
condition of the group(s) that they just invoked.
For example, if we wanted to invoke A
, B
, and C
in state s_i
and transitition to s_j
exactly after these groups are completed, then the following syntactic sugar:
fsm my_fsm {
...
s_i: { A, B, C } => s_j,
...
may be nicer than:
fsm my_fsm {
...
s_i: { A, B, C } => {A.done && B.done && C.done ? s_j : s_i} ,
...
}
I do think that allowing arbitrary assignments, and not just groups, to be executed in a given state is a good reason to make sure all transition conditions are defined by the user. But, if only groups are allowed to be invoked, then would this be redundant? I guess I can't think of a case where an FSM might want to transition without the guarantee of having completed a group call (i.e. a case where implicit group.done
guards would hinder the user)?
Welp, you've managed to demonstrate why I think provide arbitrary assignments is good idea. For:
fsm my_fsm {
...
s_i: { A, B, C } => s_j,
There is no guarantee that A
, B
, and C
all end at the same time so the guard A.done & B.done & C.done
is not correct. For example, if A
takes two cycles and the others take one cycle, that expression above will never be true (remember: done
signals are asserted for exactly one cycle).
What you wanted to express there is that A
, B
, and C
all execute in parallel which can only be implemented using distinct FSMs for each of them when we're dealing with dynamic control programs. Now, the two choices are either those invoke FSMs for each thread or you carefully generate all the right assignments.
I guess I can't think of a case where an FSM might want to transition without the guarantee of having completed a group call
I think the intuition is reasonable but here is one case: if you have multiple static groups executing in parallel and some of them complete before others, you want the relevant states to only have particular assignments in those states.
Thanks for the very useful discussion today, @parthsarkar17! I'm sorry I had to leave abruptly at the half-hour mark, but I would be super interested to hear if y'all came up with any conclusions about the parallelism thing. I guess I see a couple of broad options:
Anyway, I'm guessing that a sensible near-term plan looks mostly like option 1, but I admit I don't have a strong intuition about which way to go.
One other possibility is allowing for some kind of fork
and join
primitive. It might be particularly interesting if you were allowed to fork
an FSM's execution and then join
in a different state. Not sure if this is a good idea but this will allow hack that I was talking about where a par
thread gets to execute on the parent thread:
a;
par {
b;
c;
}
d
Turns into:
fsm parent {
0: { a } => { a.done ? 0 : 1 }
1: {
b;
t <= fork thread; // `fork` is immediate. When `thread` is done, it writes its done signal to `t`
} => { b.done ? 2 : 1 }
2: { d } => {
d.done && t.out ? end :
d.done ? 3 : 2
} action {
t.out ? reset t : {}; // set t to 0 if thread finished computing
}
3: {} => { thread.done ? end : 3 } action {
t.out ? reset t : {};
}
}
fsm thread {
0: { c } => { c.done ? 1 : end }
}
This is a pretty maximal design with some trade-offs:
end
state which finishes execution and asserts the done
signal. The done
signal is asserted for one cycle and de-asserted afterwards.
t <= fork thread
syntax states that some extra logic is created that stores the done
value into register t
.fork
ed FSM get blocked on at some point to ensure that we correctly implement par
semantics.
This is a strawperson design but it highlights some trade-offs in the space. This whole business with resetting shows the the challenges of hand writing FSMs. We need to strike a balance in explicitness and how easy these will be to generate correctly.
I just realized that the above won't work because:
1: {
b;
t <= fork thread;
} => { b.done ? 2 : 1 }
Would repeatedly attempt to fork thread
. We need some other way to continue executing b
without forking thread. The problem is that something like this also doesn't work:
1: {
b;
t <= fork thread;
} => { b.done ? 2 : 3 }
2: {
b
} => { b.done ? 2 : 3 }
Because b
might take exactly one cycle to occur in which case we'll execute b
two times...
Hey Rachit, thanks for this cool fork
/ join
primitive idea! In your final code snippet in the first case, I was wondering what the purpose was of transitioning to state 2
if b.done
. Wouldn't you want to transition to 3
in this case? If so, then b
can take one cycle and not run twice.
Does this achieve what you were thinking?
1: {
b;
t <= fork thread;
} => { b.done ? 3 : 2 }
2: {
b
} => { b.done ? 3 : 2 }
Huh, I think that would love my problem! Wonderful catch!
Cool idea to write out the “maximal” version with a built-in pair of fork & join operators (category 3 above). It might also be interesting to write out a more minimal alternative, where instead of fork & join, FSMs just have a basic way to trigger other FSMs and a way to access their done signals (category 1 above)? It would look different, surely, but I’m not yet sure how different…
Hm, I'm pretty sure this is a minimal version because any implementation will require a mechanism to trigger several FSMs in parallel (fork
) and check their done
signal (join
).
Sure, maybe it's the same thing—as long as those two constructs are literally nothing more than a control-port access.
Idea
Here's the initial discussion on Zulip, which touches on the motivation for this.
The idea is to be able to instantiate FSMs as a
group
-like construct in thewires
section of a Calyx program. In particular, these FSM constructs might let users specify:case
-like block)Motivation
As the original post mentions, this was largely motivated by the idea that the Vivado synthesis tool is able to specifically infer a finite state machine construct in Verilog when FSMs are compiled to a Verilog
case
block. Here's what our FSM transition assignment looked like before:and here's roughly what Vivado expects in order to infer an FSM;
We want Vivado to recognize our register as an FSM because it can do some cool optimizations on it to reduce resources and potentially increase frequency. We're guessing the reason Vivado prefers the second construction is because it can list all transitions, guaranteeing a "finite" number of states, which wasn't necessarily clear from the original transitions specified by the
assign
.So, if we want to emit these
case
statements via the Verilog backend, it would be nice to have an IR construct within the compiler to store these transitions. This prompted the idea to let users define FSMs themselves.Example
Here's a small example I was working with to map out what a hand-designed FSM in Calyx might look like. It's a silly multiplier for positive integers that adds an input
a_in
to a result accumulatorb_in
times, but it importantly contains a loop and conditional transitions. The control looks like this:which currently compiles to 7 states. One for each of
loada
(s0
) andloadb
(s1
), one for the generated group that updates thecomb_reg
containing whetherb.out > 0
(s2
), one for each of updating the accumulator (s3
) andb
registers (s4
), another to once again updatecomb_reg
(s5
), and finally thedone
state (s6
).Here's how I was imagining the equivalent with a new
fsm
construct, stealing inspiration from what @sampsyo posted on the Zulip thread. For this example, I decided to have only one state in whichcomb_reg
is updated. Also, while I've listed states as integers, I'm thinking of them as symbolic states.Within the
wires
section, we might have something like:Note that user-defined transition conditions should be exclusive, so the FSM remains deterministic. At least in this example, it seems that we can more specifically expect
cond1 xor cond2 xor ... xor cond3
to be true, meaning the only condition on a self-loop will be if the group invoked in that state has a lowdone
signal.When we compile these
fsm
constructs to Verilog, we can indeed have acase
statement that matches on the current state of thefsm
register. Then, within each branch of this RTL case block, we would have anif/else if
statement for each condition for a transition (e.g.comb_reg.out
from above), which we would guard with thedone
signal of the group invoked in that state. Then, the else statement (whose purpose is to encode the implicit self-loop mentioned above) would only occur if thedone
signal of the invoked group is not high.We would let the assignments within a group be compiled like normal, with one key difference: the only compiler-generated condition on whether an assignment within a group can activate is the state of the FSM from which the group is invoked. Here's why that's different from how we currently do it, and why we think this modification maintains correctness:
Currently, a group with a control block has its assignments is guarded by 1) the state of the FSM
fsm.out
and 2)tdcc.go
. Further, the number of guards increases with nested control that offloads FSMs because you'd instantiate atdcc
group for each new FSM. Based on some discussions during thecalyx-opt
meeting, the checks oftdcc{x}.go
seem unnecessary: if an FSM is currently in the middle of execution, there is never a point at whichtdcc.go
becomes low. It might offload to another FSM, but thetdcc.go
signal remains high. Based on this idea, with nested control and offloaded FSMs, we need only check the "leaf" FSM state, and none of the "node" (i.e. parent) FSM states and none of thetdcc{x}.go
signals.Now, our new
fsm
construct idea doesn't usetdcc
-like generated groups. But, we are thinking of allowing one FSM to call another FSM, similar to a group invoke. We think we can take advantage of this to dramatically reduce assignment guards, and thereby reduce LUT usage!Brief Static FSM Interlude
I agree with the idea that static
fsm
constructs should be distinct from dynamic ones. Given that assumption, we could create a newfsm
construct for static FSMs, where implicitgroup.done
guards are not added to transition conditions. Since each group will have a latency attached to it, we could compile our static FSMs the same way, where each state is a cycle.For example,
would compile to:
Static repeats are the interesting bit. We surely wouldn't want to make users create distinct states for the same loop body, so maybe there can be a
repeat
construct within thestaticfsm
block somehow?Open Questions
group.done
guards in the transitions, since, now, it may not be necessary to invoke a group at a state. Static FSMs might get around this by having a differentfsm
construct specific to static FSMs. Should we do the same and have an "assignment FSM" construct?