calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
499 stars 52 forks source link

Rethinking Static Compilation #2207

Open calebmkim opened 4 months ago

calebmkim commented 4 months ago

(Some of this post is duplicated from my lab notebook entry).

Recap of Problem

Suppose we have the following control structure (assume all groups take 1 cycle):

static seq {A; B; static repeat 10 { <body with latency N> } C; D; } 

FSMs are currently implemented like this:

parent FSM: 0->1->2->3->4 // continue to count   -> 2 + (10*N) + 1 -> etc. 
child FSM:        0->1->.....->N->0->1->2.... // will repeat 10 times 

As you can see, because the parent does not stop counting when it offloads to the child, it can drastically increase the size of the parent register.

I implemented FSMs so that the parent pauses when it offloads to the child:

parent FSM: 0->1->2  // parent pauses instead of counting  -> 3 -> 4 etc. 
child FSM:        0->1->.....->N->0 // child will repeat 10 times 
iteration FSM:    0->0->0->.....->1  // iteration FSM holds the number of iterations that have occured.  

Note that, unlike with the existing compilation technique, we need an iteration FSM to count the number of iterations of the repeat loop (the existing compilation technique doesn't need to do this: it can look at the parent FSM to know when to stop).

Results

Doing this new technique has led to some pretty good resource usage and worst slack results (compare the red line with the orange bars on these graphs: https://github.com/orgs/calyxir/discussions/2202#discussioncomment-10038812). But the tl;dr is that if we intelligently handle static-par blocks (see below) we can get results that are pretty clearly better than the existing compilation technique.

Challenge: compiling static par (warning: lots of low-level details)

One of the really nice things about compiling the existing compilation technique is that static pars are really easy to compile. Suppose we have the following control structure:

static par {
  static seq { <something that takes 5 cycles> static repeat 5 {<body takes 10 cycles>} }
  static seq { <something that takes 22 cycles> static repeat 10 {<body takes 5 cycles>} } 
} 

The FSM will look like the following:

parent FSM: 0->1->2...5->6->7->.....................22->23->24->....................68->69->70
child FSM             0->1->...5->0->1->.... // repeats 10 times 
child FSM                                            0->1->...10->0->1->.... // repeats 5 times 

The nice thing is that each child FSM easily knows exactly when to start since it can refer to the parent FSM. For example, the first child needs to start on cycle 5, so it starts when the parent FSM equals 5. The second child needs to start on cycle 22, so it starts when the parent FSM equals 22.

Using the new technique:

parent FSM: 0->1->2...5... // parent stops at 5 
child FSM             0->1->...5->0->1->.... // repeats 10 times
iteration FSM         0->0->0->...1 // iteration counts the iteration number

child FSM             // I need to start on cycle 22... when do I start? 

The challenge is knowing when the second child should start (and this problem obviously generalizes to the third, fourth, etc. children if they exist too). You can't just tell the second child to start when the FSM==22, since that isn't actually the 22nd cycle (becasue of the FSM stopping at 5 to offload to its children).

The way I see it, there are two options. At first, I thought of trying to go for option 1, but then quickly switched to option 2.

Option 1: having a "reference" thread.

(This is NOT my chosen implementation, so you should skip this if you only care about what I actually implemented and not my decisions for why I wanted to implement things the way they are.)

Perhaps the first instinct is try a compilation technique that is most similar to the existing technique (i.e., having a single FSM that all children can read from). The group of {parent FSM, child FSM, iteration FSM} can serve the same function as the previous parent thread. So for our example, the second child FSM will be activated from cycles 22 through 70. Unlike the existing technique, in which the guard is a simple 22 <= parent fsm < 70 the guard is going to be a bit more complicated.

We need to translate the query %[22:70] into an fsm guard. We'll start with

parent_fsm >= 6 | <some other condition>

parent_fsm >= 6 covers cycles 55->70.
But we still need to fill out <some other condition> cover cycles 22->55. This corresponds to when the parent fsm is offloading to the child FSM.

parent_fsm >= 6 | parent_fsm == 5 & <some condition involving the child/iterator>

Now we've covered when the parent_fsm is offloading, but we still need <some condition involving the child/iterator> to capture the start condition exaclty at cycle 22 (corresponding to cycle 17 of the child). Let's add that:

parent_fsm >= 6 | parent_fsm == 5 & ((iteration_fsm == 3 && child_fsm >=2) | iteration_fsm >=4 )

Basically, we need the condition to start being true on the 2nd cycle 4th iteration (i.e., when iteration_fsm==3 and child_fsm==2), and continue to be true for the remaining iterations (i.e., when iteration_fsm >=4).

Obviously this type of complicated guard will hurt resource usage and probably worst slack as well. And if the loop is nested deeper than the single loop in this example, this guard can get even more complicated.

One way to mitigate this would be to having a one bit register guard_reg that sets itself high when parent_fsm == 5 & ((iteration_fsm == 3 && child_fsm ==1) (child_fsm ==1 because it takes one cycle to write to the register, and we want it to become high when child_fsm==2) and then guarding with guard_reg.out instead... but I'm not sure if this would help or not.

I ended up not implementing things using this technique a) because I thought it would be too difficult to implement and b) it might not even be the best in terms of resources/worst slack. But I think it's possible that this actually may be the best choice.

Option 2: creating a new parent FSM for each thread

Basically, we create a new parent FSM for each thread in the static par block. Similar to how we compile dynamic par blocks, where we basically fire off each thread independently, and then wait for them all to finish.

Kind of like this:

parent FSM: 0->1->2...5... // parent stops at 5 
child FSM             0->1->...5->0->1->.... // repeats 10 times
iteration FSM         0->0->0->...1 // iteration counts the iteration number

parent FSM: 0->1->2->3....22... // parent stops at 22 
child FSM                  0->1->...10->0->1->.... // repeats 5 times
iteration FSM              0->0->0......1 // iteration counts the iteration number

This will obviously increases register usage, but I'm hopeful that it will decrease logic (LUT) usage since the guards should be simpler.

Also, this is easier to implement since "fire off each thread and wait for them to finish" can be implemented using the same offloading abstraction used to offload the static repeat bodies (see next section).

What IR abstractions are useful for implementing this?

This FSM generation can get tricky, so you clearly need some sort of abstraction to help you (for example, none of my examples so far even discussed nested loops, which adds another layer of complication). I settled on a tree structure (although I don't know if that's the best abstraction). This makes handling nested loops much easier (one layer of nesting in the loop <-> one layer in the tree).

Each node in the tree has a body latency and number of repeats associated with it, and each child represents an FSM that the parent FSM has offloaded execution to.

For example, if we had the following control:

static seq {<take 10 cycles>; static repeat 10 {  <body latency 5> } <take 10 cycles>; static repeat 5 {  <body latency 10> }} 

We'll get the following tree:

Screenshot 2024-07-13 at 11 01 22 AM

What abstraction should we use for static par?

The one weird thing about the abstraction is that each child can be a 1) a single tree or 2) a group of trees (this is for static par blocks). The specific designation of "a group of trees" is so that I can distinguish between children that need to execute at the same time vs. children that do not execute at the same time.

So for example if you had:

static seq {<take 10 cycles>; static par {static repeat 10 {<body latency 5> } static repeat 5 {  <body latency 10> }}}

The tree would look like:

Screenshot 2024-07-13 at 11 01 33 AM

Code

The code is on the static-repeat-compilation branch: https://github.com/calyxir/calyx/tree/static-repeat-compilation. It is very messy and not at all readable. Honestly I would not recommend trying to read the code. I will work on improving readability this week.

sampsyo commented 4 months ago

Awesome!! Thanks for the really helpful walkthrough, @calebmkim. I admit that, while reading, my instinct was also to think about the "reference thread" idea, and you have made a pretty clear case about why that is probably not a good idea (i.e., the relevant guards can get pretty dang complicated).

I also very much get your point about the tree representation for FSMs. You are absolutely right that there is need to distinguish parallel children from sequential children. A plain ol' forest doesn't quite seem to cut it, because we would need to keep track of which pars are children of which parents (recursively). This seems very related to the age-old "parallel CFG" problem (in fact, maybe it's the direct analog of that except that we are staying in the land of structured control flow instead of potentially-irreducible CFGs). And in that setting, we had exactly the same problem: you do not want to confuse sequential successors with parallel successors. So you end up either (a) coming up with "special edges" that represent parallelism, or (b) introducing "super-nodes" that encapsulate an entire parallel subgraph. It seems like those are pretty much the same options here: either add a bit to the tree's children lists to distinguish par children from every other type of parent-child relationship, or make all pars into "super-nodes" that contains a list of trees.

calebmkim commented 3 months ago

One detail I elided in this issue is that when I went for option (2) creating a new parent FSM for each thread, I made all of the threads the same length (i.e., I would increase the length of all threads to match the longest running thread). This allows for a simple par group:

par_group {
  thread_1[go] = 1'd1; 
  thread_2[go] = 1'd1; 
  thread_3[go] = 1'd1; 
  ... 
} 

This generally works fine for binary: even if one thread takes 50 cycles and you increase its latency to 200 cycles, you are only adding 2 bits. However, this can get really bad for one-hot-encoding: it adds 150 unnecessary bits.

The alternative is doing the following (suppose threads 1 and 2 take 50 cycles, and thread 3 takes 200 cycles):

par_group {
  thread_1[go] =  %[0:50] ? 1'd1; 
  thread_2[go] =  %[0:50] ? 1'd1; 
  thread_3[go] =  1'd1; 
  ... 
} 

The %[0:50] guard will be realized using thread_3 as the "reference thread". These guards will be pretty complicated, but I don't think it's as bad as (1) "having a "reference" thread" above for a couple reasons:

  1. the query always starts at 0
  2. we will only have one of these "complicated" guards per thread, which is not as many as if we were to straight up implement (1) "having a "reference" thread".

Longer-Term Thoughts

The choice of whether to make all threads the same length happens during static inlining, but the choice of choosing OHE/Binary happens during compile-static a later pass. It might be nice to expose an option to the compiler so that optimization information can be passed throughout passes (e.g.,, something so that the inlining stage can say "since I know the FSMs will be one-hot, then I better not increase the length of this thread from 50 to 200").