calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

Inliner revamp #1813

Open rachitnigam opened 5 months ago

rachitnigam commented 5 months ago

The component inlining pass has a bunch of limitations:

These come from the original implementation (https://github.com/cucapra/calyx/pull/829) which wanted to reduce the amount of IR copying. Towards the end, the copying was needed anyways and so the current implementation makes the wrong trade-off.

We should revamp it to copy things as much as needed to support the two cases above. Specifically, we should make use of the ir::Rewriter infrastructure to compute how to rewrite the groups and the control program.

@bcarlet mentioned the potential for implementing more heuristics to automate the decision of inlining more in the future as well. @andrewb1999 mentioned that one of the big missing optimizations in HLS tools is managing FSM sizes. Both of these are unlocked by reimplementing the inliner to be more general.

rachitnigam commented 3 months ago

Just as a reminder for why making inlining work more generally is a good idea, it makes all the other optimizations so much more powerful. For example, see the results from @calebmkim's paper:

image

Fully inlining + resource sharing reduces usage by half!

calebmkim commented 1 month ago

(Edited to a more tricky example) Trying to implement this, and I'm trying to think of possible edge cases.

Consider the following example:

component child (in: 32) -> (out: 32) {
  cells {
    add = std_add(32); 
  } 
  wires {
    // continuous assignments assigning to add.left and add.right
    add.left = in; 
    add.right = in; 
  } 
  ... 
} 

component main () {
  cells {
    child_instance = child(); 
  } 
  ... 
  control {
    invoke child_instance(in = 32'd2)(); 
    invoke child_instance(in = 32'd3)(); 
  } 
} 

How would we inline these invoke child_instance? This is similar to an invoke child_instance()() with comb group where comb group performs the addition (we've just moved the comb group to be continuous assignments inside of the child component).

The problem with this scenario is that we need add.left = 32'd2; add.right = 32'd2; to be active exactly for the control of the child_instance. To do this, we would need something like a generalized with statement (i.e., you can attach with to any control statement).

rachitnigam commented 1 month ago

Yes, this is tricky! If we think about the kind of circuit that is generated when everything is compiled down, we'll end up muxing the inputs to the in signal of child. Maybe the trick here is that when inlining continuous assignments from a component, we need to generate some muxing logic along with the right signaling infrastructure.

Bigger picture question: should Calyx allow groups to abstractly refer to the control signals that will eventually be used to implement the surrounding FSMs? Is there a nice and composable way of doing that?

sampsyo commented 1 month ago

Thanks for sketching this up, @calebmkim.

To take your discussion one step farther (envisioning what this would look like with the much-maligned "general with" statement)… first, correct me if I'm wrong, but don't continuous assignments have the "magical" property that they are always active, even when the component that contains them isn't running (i.e., its go hasn't yet been set to high)? If that's the case, for continuous assignments in particular, it suffices to just copy them from the callee to the caller. So the inlined version could look like this:

component main () {
  cells {
    child_instance_add = std_add(32);
    child_instance_in = std_wire(32);
  } 
  wires {
    child_instance_add.left = child_instance_in.out; 
    child_instance_add.right = child_instance_in.out; 
    comb group child_instance_invoke1 {
      child_instance_in.in = 32'd2;
    }
    comb group child_instance_invoke2 {
      child_instance_in.in = 32'd3;
    }
  }
  control {
    with child_instance_invoke1 {
      // copy of child control
    }
    with child_instance_invoke2 {
      // identical copy of child control
    }
  } 
} 

Is that right? Maybe this is exactly the same as your proposal, differing only in that I generated a std_wire for the input port to make the copying of the continuous assignments a little simpler.

Anyway, it's pretty weird to imagine how to "fake" this (skip the fictional "general with" and get the behavior we are all imagining without it). @rachitnigam's suggested approach, to somehow expose the control signals directly to the structural part of the program, could perhaps work but would require a creative idea w/r/t how to do this in general… maybe one way of looking at this is that port assignments in invoke are already working quite a bit like a generalized with and it is hard to get this functionality without invoke.

An extremely dumb idea would be to create a 1-bit register for each invocation that holds 1 when the invocation is active… this would waste a cycle to set this register, but it would maybe be a stopgap?????

calebmkim commented 1 month ago

correct me if I'm wrong, but don't continuous assignments have the "magical" property that they are always active, even when the component that contains them isn't running (i.e., its go hasn't yet been set to high)?

Yes, you're right (so it was a bit misleading for me to compare it to an invoke with comb group statement, since they're not exactly the same. I suppose my point was that this example presents a similar challenge to what inlining an invoke with comb group statement would do.)

Is that right? Maybe this is exactly the same as your proposal, differing only in that I generated a std_wire for the input port to make the copying of the continuous assignments a little simpler.

Yes, we're on the same page. Your example is pretty much exactly how I would imagine inlining this example if we were to somehow suddenly support a generalized with operator.

An extremely dumb idea would be to create a 1-bit register for each invocation that holds 1 when the invocation is active… this would waste a cycle to set this register, but it would maybe be a stopgap?????

This sounds like it should work-- but it seems like it would have to waste two cycles, since we would have to set the register to 0 after the invocation, right? Also, if we were to do this, we could use this trick to solve the problem for invoke with comb group statements too (the comb group assignments can be moved into continuous assignments that are only triggered during the invocation).

However, there's a separate question of whether we would even want to implement something like this, given this latency penalty.

Here's one proposal: we could do some really simple check before doing any inlining: if there is only ever one set of invocation arguments for a given instance (and never any invoke with comb), then we do not need to write to the register (we can just inline using the current technique). If there are multiple "sets" of invocation arguments (or comb groups), then we will use this "write to a 1-bit register" technique. Obviously, this is still a stopgap, but maybe an ok one?

Anyway, it's pretty weird to imagine how to "fake" this (skip the fictional "general with" and get the behavior we are all imagining without it).

"Faking" it sounds interesting but vague right now; I might spend some time thinking about what it would look like concretely--it may turn out that it's not possible.

sampsyo commented 1 month ago

Yeah, you're absolutely right about wasting another cycle to set the special 1-bit register to 0. And it's a good idea to special-case the one-invocation case to avoid needing this…

I fully support more thinking about how to "fake" the "general with", although I totally agree there's a good chance it's impossible. Maybe it's time to reconsider the general with again…

calebmkim commented 1 month ago

After thinking about it, the general with statement seems like it is (probably?) the most natural way to solve our problem.

Here is a summary of my thinking: 1) We need some way of being able to activate assignments for exactly some control block. 2) We run into problems with dynamic control, since the compiler is free to spend as much time as it wants in between groups, for example. We still need to activate the assignments during this "in between" time, even if no groups are currently executing. Therefore, in order for these assignments to be active, they must be continuous assignments. The problem with continuous assignments is that they are too coarse-grained to let us activate assignments for exactly a control block. The only solution I could think of is continuous assignments that are something like this:

child_instance_in.in = // magical guard that is active for a specific control stmt ? 32'd2;
child_instance_in.in = // magical guard that is active for a different specific control stmt ? 32'd3;

But it seems to me that this is basically the same thing as a generalized with operator.

This leads to the question: what are the downsides of a generalized with operator. One is that we are going to have to think about how it affects all of our existing optimizations (e.g., liveness analysis) and we will have to deal with it in future analyses. What are the other downsides?

calebmkim commented 1 month ago

Synchronous Meeting Summary

Why this problem matters

1) Inlining is a very important optimization, since it is what makes other optimizations possible (e.g., the graphs at beginning of thread). So it is worth it to try to make it really aggressive. 2) The 2-cycle latency penalty is not a viable long-term solution for inlining

Solution to this problem

The solution we have come up with is to explore the general with. The reason why is the following: we need some way for assignments to be active exactly for some control (e.g., child_instance_in.in = 32'd2 for the above design). These assignments either have to be placed in the general with or in continuous assignments (some sort of asymmetric par is also another solution, but general with seems to be a simpler solution). The main downside to a general with is that it makes analysis more difficult because it introduces a bunch of conflicts in assignments. But continuous assignments do this too, and they introduce even more conflicts! In other words, general with is actually more informative for analyses than continuous assignments.

Next Steps

1) Look at which passes/analyses have to change with a generalized with. I'll report back on if these changes are all doable. 2) If we determine it is doable, then the next step is to try an implement it. It should be simple (hopefully) to implement-- you just guard the comb group assignments with whichever fsm states the control is active (we do this in tdcc). For example, child_instance_in.in = 3 <= fsm.out <= 7 ? 32'd2 .

(Note that this entire problem is only for dynamic invokes. Static invokes are easy to inline, since you can use a static par that activates the continuous assignments for exactly the length of time you want them to).

calebmkim commented 1 month ago

After looking through the codebase for a bit, I think a generalized with would be doable. From a quick glance, the things that need to change are the following.

1) Domination Analysis: I think with comb_group { body } would be treated somewhat similarly to how we treat a par blocks. (i.e., we can almost treat it like par {comb_group; body }: neither comb_group nor anything in body dominate each other, but both are guaranteed to execute (which makes them different than something like if branches). 2) Live Range Analysis: similar to Domination Analysis, we can treat them kind of like par blocks.
3) Compaction: the function that builds the dependency graph for a given seq takes a continuous_assigns parameter, which helps conservatively insert dependencies that respect continuous assigns: for seq that exists within (possibly nested) with comb_group statements, we just have to add the assignments to comb_group as part of the continuous_assigns. 4) dead-assignment-removal will get more complicated, since we have to look at the control instead of locally reason about groups. (We can always "cheat" a little bit if we wanted to by being overly conservative, and treat all comb group assigns like they are continuous assignments).

There are probably some other minor things that need to change, but I think it's definitely doable if we want to take a shot.

rachitnigam commented 1 month ago

Thanks for the analysis @calebmkim! We should also define some optimizations that shrink the scope of with before these optimizations kick in so as to reduce its overall impact. A couple of rewrites that come to mind:

Reducing the scope of with guards is good because it also enables us to generate simpler guards for the resulting programs hopefully?

calebmkim commented 1 month ago

Yep, this all makes sense-- it seems like optimizations to reduce the scope of with would help.

If we have with c { x } where x is a group, we can inline the combinational assignments into the group.

I'm not quite sure this would necessarily help generate simpler guards (it will probably make the program easier to analyze)---wouldn't the assignments in c have the same guard either way (just guarded by the fsm state corresponding to x)?

Another Question

Another question: would we want to support a static with comb_group {body} statement?

One argument towards "no" is the following: we can just use a static par {body; comb group}. Also, during static-promotion, we can just directly promote with comb_group {body} -> static par {body; comb_group}. This allows us to avoid the extra work of supporting an extra operator.

On the other hand, there is a weird asymmetry to this, and it's probably not too much work to just add a static with operator.

rachitnigam commented 1 month ago

par<n> { group; comb group } is pretty weird. I see why this works (because you know exactly how long group runs for). Do we actually need a static with instead of just allowing with to exist in either context?

Maybe we need to redesign the data structures a bit so that we don't stratify static and non-static control ops and then just error out when non-static is nested in static.