calebmkim commented 3 weeks ago

During the AMC/Calyx-Opt meetings, we realized that Vitis was doing a one-hot encoding for their FSMs, which probably explains their lower LUT usage. This got us thinking about a new way to represent FSMs in the IR of the Calyx Compiler (not in the actual Calyx language) that will let us better exploit different sorts of FSM optimizations. (There's an interesting old lecture discussing this here.) Going to tag @parthsarkar17 who I wrote this up with.

Idea

The idea is to have a separate struct used in the TDCC (and the compile-static) pass that is meant to specifically represent FSMs. Something like this (obviously just a sketch an not a well thought out implementation detail):

struct FSMObject {
  num_states: u64, 
  queries: Vec<QueryType>, // for example, how many assignments need to read from fsm.out == 2? How many care about fsm.out between 3 and 7?  
  transitions: Vec<CurrentState, NextState, Guard>, // already there  
} 
enum QueryType {
  LT(u64), 
  GT(u64), 
  EQ(u64), 
  BETWEEN(u64, u64)
  Port, // if someone is just reading fsm.out without comparing it to anything 
}

Then, separately, we will instantiate the FSM (i.e., actually build the registers and assignments necessary for this FSM). When we perform the instantiation, this will make it easer for us to know what optimizations to perform, based on the FSM characteristics (i.e.., num_states, queries, and transitions). For example, do we want to split the FSM (or even just copy it to reduce the fanout on some of them)? Should we use a one-hot encoding? etc.

Keep in mind that this all occurs within the TDCC, so there aren't actually any changes to the language.

Also looping in @ayakayorihiro since Adrian mentioned that you were interested in something that sounds related, but for profiling purposes rather than optimizations.

Also @rachitnigam, @andrewb1999, please feel free to add on if there is any thing we missed or got wrong.

rachitnigam commented 3 weeks ago

I would not allow for the QuertType::Port at all because that would tie us to a particular FSM implementation strategy. For example, if we want to internally generate multiple registers to reduce fan-out, QueryType::Port would allow users to observe the different registers.

I also think we should have something like:

query: HashMap<QueryType, ir::Port> // maps query to wire.out that generates it

This allows us to memoize and reuse wires from the generated port.

rachitnigam commented 3 weeks ago

BTW, I should mention that this is super exciting overall! Thanks for writing up the issue. I think there is a lot of room for cool stuff. I'll add two pointers:

140: Talks about some FSM optimizations
@andrewb1999 found notes on FSM optimization

rachitnigam commented 3 weeks ago

I would go ahead and implement a prototype if you have a good sense of what to do @calebmkim and we can discuss it in a future Calyx meeting.

calebmkim commented 3 weeks ago

I would not allow for the QuertType::Port at all because that would tie us to a particular FSM implementation strategy. I also think we should have something like: query: HashMap<QueryType, ir::Port> // maps query to wire.out that generates it

Yeah, both points make sense.

BTW, I should mention that this is super exciting overall!

Yeah, I agree that this is a really interesting/exciting direction!

I would go ahead and implement a prototype if you have a good sense of what to do @calebmkim and we can discuss it in a future Calyx meeting.

Sounds good; I can start poking around TDCC and playing with the above^ sketch.

calebmkim commented 3 weeks ago

I'm looking at the TDCC code, and I think it already does something pretty close to this! It has the following Schedule struct:

/// Represents the dyanmic execution schedule of a control program.
struct Schedule<'b, 'a: 'b> {
    /// Assigments that should be enabled in a given state.
    pub enables: HashMap<u64, Vec<ir::Assignment<Nothing>>>, 
    /// Transition from one state to another when the guard is true.
    pub transitions: Vec<(u64, u64, ir::Guard<Nothing>)>,
    /// The component builder. The reference has a shorter lifetime than the builder itself
    /// to allow multiple schedules to use the same builder.
    pub builder: &'b mut ir::Builder<'a>,
}

schedule.calculate_states() fills in all this information, and schedule.realize_schedule() actually instantiates the FSM.

There are a few interesting things we could try to do.

1) Improve realize_schedule to realize the FSM differently sometimes (e.g., use one-hot encoding sometimes)

2) Separate out static FSMs in the IR as well.

ayakayorihiro commented 2 weeks ago

Thanks for this writeup and looping me in!! This is all very cool :) Just out of curiosity, can you explain a bit about what the queries are for?

calebmkim commented 2 weeks ago

I think it's mainly to make it easier to reason about optimizations when we are instantiating the FSM.

For example, if there is an assignment: lhs = 4 <= fsm <= 6 ? rhs, it might be more efficient to implement it as fsm == 4 | fsm == 5, rather than using <= and >= operators. For the different queries, it might make sense to instantiate these guards in different ways.

I'm not 100% sure if this makes a difference, but it may make sense to have a single wire for each query. For example:

lhs1 = fsm == 1 ? rhs1
lhs2 = fsm == 1 ? rhs2

to

wire.in = fsm == 1 ? 1 : 0; 
lhs2 = wire.out ? rhs1; 
lhs2 = wire.out ? rhs2;

If we have queries as part of the struct, we can imagine a single method which goes through each query, creates a wire corresponding to each query (where each query is instantiated intelligently, e.g., 4 <= fsm <= 6 might be instantiated as fsm == 4 | fsm == 5), and then replaced each query with that wire.

Also, if we see there are a lot of queries (i.e., a lot of assignments read from the FSM's current state), then it means the logic for the FSM is getting very complicated, so we might want to split the FSM into 2 separate registers when we actually instantiate it. This increases register usage, but (if the FSM is on the critical path of the design) then it will shorten the critical path since now each FSM has to do less work.

Obviously, we don't need queries to be part of the struct to perform these optimizations, it just makes it easer to have access to this information in the struct when doing these optimizations.

ayakayorihiro commented 2 weeks ago

I see! Thanks a lot for explaining this, it makes a lot of sense to keep the information around for optimization purposes.

On a somewhat related note, one of my goals for the week is to get TDCC to emit a JSON file denoting the FSM info (the dump-fsm option gives a human-readable version of the info, but it's easier to process as JSON). For a first pass I'll focus on getting the state and transition info, but please let me know if there's anything else I should add!

parthsarkar17 commented 2 weeks ago

Whoa, that's really cool! That would be really awesome and may be a great addition to the performance dashboard. Not sure if we even want that kind of information on the dashboard, but I feel like it could be helpful to see how optimizations affect output FSMs. When you get to that point, would you be able to let us know? Thanks!!

ayakayorihiro commented 2 weeks ago

Sure thing!

calebmkim commented 1 week ago

Just going to write down the state of the current FSM optimizations that we have been thinking of.

Static FSMs (I'm working on these currently)

OHE
Sharing registers (for example, should different static islands still use the same register)? (Tradeoff between lower register usage and more complicated logic).
Splitting registers (instead of one register counting from 0->1->2->...n, multiple registers count from 0->1->2 and then passes off to another register that counts 0->1->2... etc.)? (Reduces fanout for queries but makes counting to n more complicated)
Duplicating registers (multiple registers all counting from 0->1->2->....n)? (Reduces fanout and it is easy to count to n. But it increases register usage).

Dynamic FSMs (I think @parthsarkar17 is working on these).

Splitting (we already have @new_fsm, we just want to automatically insert the attribute)
OHE

Querying

Adrian had an idea: if we know ahead of time that we have very few queries but each query has a large range (e.g., %[4:7528]), we could store the query in a register rather than checking the FSM value directly.
If the queries are small, we could check equality multiple times rather than using a less-than and greater-than operations (equality is cheaper, since less-than and greater-than are implemented using subtractors).

Using attributes for each optimizations

In general, it might be helpful to introduce attributes for each of these optimizations. I can imagine one pass inserting these attributes on the control and then the actual compilation passes blindly following whatever the attributes say: this makes things simpler, since it simplifies the logic of making the decision of which optimization to implement with the actual implementation. It also allows users of Calyx to control the FSMs generated themselves if they want.

rachitnigam commented 1 week ago

Super exciting stuff!

we could store the query in a register rather than checking the FSM value directly

I'm not sure what this would look like? Are you saying that instead of having a constant (7528), we'd put the value in the register?

In general, it might be helpful to introduce attributes for each of these optimizations.

This is an interesting idea but once again, the trade-off is that attributes add implicit dependencies between passes. The decoupling between the optimization and the decision to optimization is nice but do we want passes earlier in the pipeline to be able to affect the optimization of FSMs as well?

calebmkim commented 1 week ago

I'm not sure what this would look like? Are you saying that instead of having a constant (7528), we'd put the value in the register?

I'm not even sure if this is even a good idea but ...

lhs = %[4:758] ? rhs

To compile this, we have a one-bit register reg

reg.write_en = %3 | %757 ? 1'd1;
reg.in = %3 ? 1'd1 : 1'd0; // set high at %3, set back to low at %757. 
lhs = reg.out ? rhs;

Now, we don't have to check any ranges (i.e., we don't need a > or a <): we just check for equality twice.

This may not actually be a good idea, though. Just thought it was interesting (and useful to write down before I forget about it).

(And agreed about the attributes: I think it may be worth talking synchronously about the tradeoffs of what this decision would be at some point).

parthsarkar17 commented 1 week ago

Yep! @calebmkim hit everything we discussed, but I also wanted to give some more details on what we discussed w.r.t. @new_fsm attribute insertion in optimizing dynamic FSMs. Caleb mentioned that the number of groups (invokations?) in a control block should be the first shot at a metric for when to use the attribute. For example, if we have:

control {
    seq {G_1; G_2; ... ; G_n;}
}

the goal (if we even want to split for this value of n) would be to do:

control {
    seq {
        seq {G_1; G_2; ... ; G_{n/2};};
        seq {G_{n/2 + 1}; ...; G_n};
    }
}

so we can build smaller FSMs for each of the sub-seq blocks. This is a pretty simple case, and things can clearly get much more complicated with while-blocks and stuff. Hopefully, the specific heuristics can be revealed by the performance dashboard, but here's the plan we settled on for now:

Build a pass fornew_fsm insertion, and have it run after collapse_control.
Have a number of invokations to act as a threshold for FSM separation, which can be passed in as an argument when running the compiler with this pass.
I'm not sure if while-blocks are given their own FSM automatically. If they are, then maybe the pass should consider a while to be just a normal group, so we don't need to count the groups internal to the while when determining whether to split given the threshold. If not, then maybe we should count the groups internal to the while.

Any thoughts would be appreciated! Thanks!!

rachitnigam commented 1 week ago

Interesting stuff @parthsarkar17! As always, I think it is going to be hard to judge the impact of some of these optimizations without implementing and trying them out with the benchmark suite. I recommend starting with a simple one (like the seq separation) and seeing how it alters the synthesis results!

calebmkim commented 5 days ago

I'm not sure if while-blocks are given their own FSM automatically. If they are, then maybe the pass should consider a while to be just a normal group, so we don't need to count the groups internal to the while when determining whether to split given the threshold. If not, then maybe we should count the groups internal to the while.

In dynamic Calyx, while blocks are not automatically given their own FSM. So I think something like while lt.out { seq {A; B; C; D;} should count as 4 groups. Also, this problem of estamating how big the FSM will be for dynamic control has already come up in static-promotion. Might be worth re-factoring some code in your pass. https://github.com/calyxir/calyx/blob/cf8224e2f3f035050390a5f98e4939a8cef3e237/calyx-opt/src/passes/static_promotion.rs#L173

calebmkim commented 3 days ago

tl;dr: I think the @new_fsm attribute can be seen as a variant of the FSM splitting optimization I've talked about earlier.

Yesterday in AMC-CalyxOpt we discussed the @new_fsm attribute for static control.

For example, suppose we have

static seq { 
  A; B; C; D;
  @new_fsm static seq {E; F; G} 
  H; I; J; 
}

We were discussing how to implement @new_fsm: specifically, whether the "parent" FSM should continue counting even when it hands off control to its child FSM, or it should just stay put:

static seq { // fsm0
  A; B; C; D; // fsm0 counts 0->1->2->3
  @new_fsm static seq {E; F; G} // fsm1 counts 0->1->2. Should fsm0 a) count from 3->4->5 here? or b) just stay put at 3? 
  H; I; J; // fsm0 picks back up and starts counting here again.
}

The parent can either a) keep counting while the child is executing or b) just stay put while the child is executing (and wait for the "done" signal, i.e., wait until the child FSM finishes). a) is easier to implement and more composable but b) saves FSM states (this becomes especially important if we want to do OHE, where one state = 1 bit).

However, in the static-calyx meeting today we just realized that option b) is very similar to the partitioning idea we have already talked about: FSM "splitting"

Register 1: 0->1->2->3
Register 2:          0->1->2
Register 3:                 0->1->2

vs. @new_fsm implentation

Register 1: 0->1->2->3.    4->5->6
Register 2:          0->1->2

The only difference is that option b) hands control back over to register 1 rather than passing control to a new register (register 3).

So I think it might be worth thinking about @new_fsm, FSM splitting, and FSM duplication, all as different variants of the same optimization. I'm just writing this down so that everyone is on the same page about how we're thinking about optimizing static FSMs.

rachitnigam commented 1 day ago

Oh, this is an awesome insight! We can generalize and maybe hide away new_fsm with this kind of reasoning!

calyxir / calyx

Separate Out FSM in IR when during TDCC and `compile-static` #2020

Idea

140: Talks about some FSM optimizations

Using attributes for each optimizations