calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
492 stars 50 forks source link

First Class FSMs #2297

Open parthsarkar17 opened 2 weeks ago

parthsarkar17 commented 2 weeks ago

Idea

Here's the initial discussion on Zulip, which touches on the motivation for this.

The idea is to be able to instantiate FSMs as a group-like construct in the wires section of a Calyx program. In particular, these FSM constructs might let users specify:

Motivation

As the original post mentions, this was largely motivated by the idea that the Vivado synthesis tool is able to specifically infer a finite state machine construct in Verilog when FSMs are compiled to a Verilog case block. Here's what our FSM transition assignment looked like before:

assign fsm2_in =_guard485 ? 3'd6 : _guard490 ? 3'd5 : _guard497 ? 3'd2 : ... 3'd0

and here's roughly what Vivado expects in order to infer an FSM;

always @ (*) begin
    case (fsm2_out)
       ...
       3'd2 : begin
            if (raise_err_done_out) begin
                fsm2_in = 3'd6;
            end
            else begin
                fsm2_in = 3'd2;
            end
        end
        3'd3 : begin
            if (read_payload_from_mem_pop_done_out) begin
                fsm2_in = 3'd4;
            end
            else begin
                fsm2_in = 3'd3;
            end
        end
      ...

We want Vivado to recognize our register as an FSM because it can do some cool optimizations on it to reduce resources and potentially increase frequency. We're guessing the reason Vivado prefers the second construction is because it can list all transitions, guaranteeing a "finite" number of states, which wasn't necessarily clear from the original transitions specified by the assign.

So, if we want to emit these case statements via the Verilog backend, it would be nice to have an IR construct within the compiler to store these transitions. This prompted the idea to let users define FSMs themselves.

Example

Here's a small example I was working with to map out what a hand-designed FSM in Calyx might look like. It's a silly multiplier for positive integers that adds an input a_in to a result accumulator b_in times, but it importantly contains a loop and conditional transitions. The control looks like this:

control {
    loada;
    loadb;
    while gt.out with cond {
      resincr;
      bdecr;
    }
  }

which currently compiles to 7 states. One for each of loada ( s0 ) and loadb ( s1 ), one for the generated group that updates the comb_reg containing whether b.out > 0 ( s2 ), one for each of updating the accumulator ( s3 ) and b registers ( s4 ), another to once again update comb_reg ( s5 ), and finally the done state ( s6 ).

Here's how I was imagining the equivalent with a new fsm construct, stealing inspiration from what @sampsyo posted on the Zulip thread. For this example, I decided to have only one state in which comb_reg is updated. Also, while I've listed states as integers, I'm thinking of them as symbolic states.

Within the wires section, we might have something like:

fsm myfsm {
    0: { loada; } => <*> 1, 
    1: { loadb; } => 2,
    2: { comb_cond_update; } => {
            comb_reg.out: 3, <**>
           !comb_reg.out: 5,
        },
    3: { resincr; } => 4,
    4: { bdecr; } => 2,
    5: => 0
}
* implicit loada.done guard here, plus implicit self-loop if any transition condition not met
** again, implicit guard of comb_cond_update.done, with self-loop

Note that user-defined transition conditions should be exclusive, so the FSM remains deterministic. At least in this example, it seems that we can more specifically expect cond1 xor cond2 xor ... xor cond3 to be true, meaning the only condition on a self-loop will be if the group invoked in that state has a low done signal.

When we compile these fsm constructs to Verilog, we can indeed have a case statement that matches on the current state of the fsm register. Then, within each branch of this RTL case block, we would have an if/else if statement for each condition for a transition (e.g.comb_reg.out from above), which we would guard with the done signal of the group invoked in that state. Then, the else statement (whose purpose is to encode the implicit self-loop mentioned above) would only occur if the done signal of the invoked group is not high.

We would let the assignments within a group be compiled like normal, with one key difference: the only compiler-generated condition on whether an assignment within a group can activate is the state of the FSM from which the group is invoked. Here's why that's different from how we currently do it, and why we think this modification maintains correctness:

Currently, a group with a control block has its assignments is guarded by 1) the state of the FSM fsm.out and 2) tdcc.go. Further, the number of guards increases with nested control that offloads FSMs because you'd instantiate a tdcc group for each new FSM. Based on some discussions during the calyx-opt meeting, the checks of tdcc{x}.go seem unnecessary: if an FSM is currently in the middle of execution, there is never a point at which tdcc.go becomes low. It might offload to another FSM, but the tdcc.go signal remains high. Based on this idea, with nested control and offloaded FSMs, we need only check the "leaf" FSM state, and none of the "node" (i.e. parent) FSM states and none of the tdcc{x}.go signals.

Now, our new fsm construct idea doesn't use tdcc-like generated groups. But, we are thinking of allowing one FSM to call another FSM, similar to a group invoke. We think we can take advantage of this to dramatically reduce assignment guards, and thereby reduce LUT usage!

Brief Static FSM Interlude

I agree with the idea that static fsm constructs should be distinct from dynamic ones. Given that assumption, we could create a new fsm construct for static FSMs, where implicit group.done guards are not added to transition conditions. Since each group will have a latency attached to it, we could compile our static FSMs the same way, where each state is a cycle.

For example,

staticfsm my_fsm {
    s0 : { < 3 cycle group > } => s1,
    s1 : { < 1 cycle group > } => s2,
    s2 : ... => ...
}

would compile to:

always @ (*) begin
    case (fsm_out)
       4'd0 : fsm_in = 4'd1; // s0 begins here
       4'd1 : fsm_in = 4'd2;
       4'd2 : fsm_in = 4'd3;
       4'd3 : fsm_in = 4'd4 // s1 begins here
       4'd4 : fsm_in = 3'd5 // s2 begins here
       ...

Static repeats are the interesting bit. We surely wouldn't want to make users create distinct states for the same loop body, so maybe there can be a repeat construct within the staticfsm block somehow?

Open Questions

andrewb1999 commented 2 weeks ago

Static repeats are the interesting bit. We surely wouldn't want to make users create distinct states for the same loop body, so maybe there can be a repeat construct within the staticfsm block somehow?

For static repeat we can use a common trick to avoid state explosion, create a separate binary counter register that is incremented every cycle when in a given FSM state. We then transition to the next fsm state when the counter reaches a certain value. This could look some like this (in super fake syntax, just showing the general idea):

reg counter<32>;
cmp = counter == repeat_len;
staticfsm my_fsm {
...
s0 : {incr(counter)} => s1 // start of static repeat
s1 : { loop body state 1 } => s2 // repeat body with static latency of 2
s2 : { loop body state 2 } => { cmp : s3, !cmp : s0}
s3 : { ... } => s4 // after loop body
...
}
rachitnigam commented 2 weeks ago

So, if we want to emit these case statements via the Verilog backend, it would be nice to have an IR construct within the compiler to store these transitions

I think we have a broader goal here than just emitting FSMs for Vivado to recognize. Broadly, we want to be able to optimize decisions about how a control program is implemented using the FSM representation.

implicit loada.done guard here, plus implicit self-loop if any transition condition not met

I don't like the implicit nature of these guards and transitions. I think we should be very explicit with what the guards and transitions are:

0: { loada; } => loada[done] ? 1; default: 0

The default case must be specified for each transition and should explicitly point out what the state of the FSM is going to be if no transition is valid.

Based on this idea, with nested control and offloaded FSMs, we need only check the "leaf" FSM state

I like this idea! I can't find the issue but @paili0628 was working on changing the go-done interface so we don't have to assert the go signal for the duration of the execution.

Another a possible strawperson proposal for combining static and dynamic FSMs:

staticfsm my_fsm {
    s0 : { mult } => %3 ? s1,

The %3 guard is the usual timing guard that asserts it will be true on the third cycle.

parthsarkar17 commented 2 weeks ago

I don't like the implicit nature of these guards and transitions. I think we should be very explicit with what the guards and transitions are:

My thought process here was, if we have a construct for specifying dynamic FSMs in which groups are invoke-able (and, say, arbitrary assignments are not allowed), then some redundancy may exist if we expect users to list all conditions of transitioning to a new state, including the done condition of the group(s) that they just invoked.

For example, if we wanted to invoke A, B, and C in state s_i and transitition to s_j exactly after these groups are completed, then the following syntactic sugar:

fsm my_fsm {
    ...
    s_i: { A, B, C } => s_j,
    ...

may be nicer than:

fsm my_fsm {
    ...
    s_i: { A, B, C } => {A.done && B.done && C.done ? s_j : s_i} ,
    ...
}

I do think that allowing arbitrary assignments, and not just groups, to be executed in a given state is a good reason to make sure all transition conditions are defined by the user. But, if only groups are allowed to be invoked, then would this be redundant? I guess I can't think of a case where an FSM might want to transition without the guarantee of having completed a group call (i.e. a case where implicit group.done guards would hinder the user)?

rachitnigam commented 2 weeks ago

Welp, you've managed to demonstrate why I think provide arbitrary assignments is good idea. For:

fsm my_fsm {
    ...
    s_i: { A, B, C } => s_j,

There is no guarantee that A, B, and C all end at the same time so the guard A.done & B.done & C.done is not correct. For example, if A takes two cycles and the others take one cycle, that expression above will never be true (remember: done signals are asserted for exactly one cycle).

What you wanted to express there is that A, B, and C all execute in parallel which can only be implemented using distinct FSMs for each of them when we're dealing with dynamic control programs. Now, the two choices are either those invoke FSMs for each thread or you carefully generate all the right assignments.

I guess I can't think of a case where an FSM might want to transition without the guarantee of having completed a group call

I think the intuition is reasonable but here is one case: if you have multiple static groups executing in parallel and some of them complete before others, you want the relevant states to only have particular assignments in those states.

sampsyo commented 2 weeks ago

Thanks for the very useful discussion today, @parthsarkar17! I'm sorry I had to leave abruptly at the half-hour mark, but I would be super interested to hear if y'all came up with any conclusions about the parallelism thing. I guess I see a couple of broad options:

  1. FSMs can invoke each other using the machinery of ordinary conditional assignments (how, exactly? not sure) and a parent FSM uses that to delegate to child FSMs. (The interesting part in this option is nailing down exactly how they call each other.)
  2. There is somehow a first-class notion of parallel FSM execution… more like an NFA than a DFA, in the sense that multiple states in the same machine can be active at the same time? This is probably not the way to go, because it entails a pretty elaborate lowering to real hardware implementations, unless I'm missing something.
  3. We bake in a construct for fork-join patterns in FSMs, i.e., "super-states" that trigger other FSMs. This one also seems to probably be somewhat too high-level?

Anyway, I'm guessing that a sensible near-term plan looks mostly like option 1, but I admit I don't have a strong intuition about which way to go.

rachitnigam commented 2 weeks ago

One other possibility is allowing for some kind of fork and join primitive. It might be particularly interesting if you were allowed to fork an FSM's execution and then join in a different state. Not sure if this is a good idea but this will allow hack that I was talking about where a par thread gets to execute on the parent thread:

a;
par { 
  b;
  c;
}
d

Turns into:

fsm parent {
  0: { a } => { a.done ? 0 : 1 } 
  1: { 
      b; 
      t <= fork thread; // `fork` is immediate. When `thread` is done, it writes its done signal to `t`
  } => { b.done ? 2 : 1 }
  2: { d } => { 
    d.done && t.out ? end : 
    d.done ? 3 : 2
  } action {
    t.out ? reset t : {};  // set t to 0 if thread finished computing
  }
  3: {} => { thread.done ? end : 3 } action {
    t.out ? reset t : {};
  }
}

fsm thread {
  0: { c } => { c.done ? 1 : end }
}

This is a pretty maximal design with some trade-offs:

This is a strawperson design but it highlights some trade-offs in the space. This whole business with resetting shows the the challenges of hand writing FSMs. We need to strike a balance in explicitness and how easy these will be to generate correctly.

rachitnigam commented 2 weeks ago

I just realized that the above won't work because:

  1: { 
      b; 
      t <= fork thread;
  } => { b.done ? 2 : 1 }

Would repeatedly attempt to fork thread. We need some other way to continue executing b without forking thread. The problem is that something like this also doesn't work:

  1: { 
      b; 
      t <= fork thread;
  } => { b.done ? 2 : 3 }
  2: {
    b
  } => { b.done ? 2 : 3 }

Because b might take exactly one cycle to occur in which case we'll execute b two times...

parthsarkar17 commented 1 week ago

Hey Rachit, thanks for this cool fork/ join primitive idea! In your final code snippet in the first case, I was wondering what the purpose was of transitioning to state 2 if b.done. Wouldn't you want to transition to 3 in this case? If so, then b can take one cycle and not run twice.

Does this achieve what you were thinking?

  1: { 
      b; 
      t <= fork thread;
  } => { b.done ? 3 : 2 }
  2: {
    b
  } => { b.done ? 3 : 2 }
rachitnigam commented 1 week ago

Huh, I think that would love my problem! Wonderful catch!

sampsyo commented 1 week ago

Cool idea to write out the “maximal” version with a built-in pair of fork & join operators (category 3 above). It might also be interesting to write out a more minimal alternative, where instead of fork & join, FSMs just have a basic way to trigger other FSMs and a way to access their done signals (category 1 above)? It would look different, surely, but I’m not yet sure how different…

rachitnigam commented 1 week ago

Hm, I'm pretty sure this is a minimal version because any implementation will require a mechanism to trigger several FSMs in parallel (fork) and check their done signal (join).

sampsyo commented 1 week ago

Sure, maybe it's the same thing—as long as those two constructs are literally nothing more than a control-port access.