calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

improve `comb_prop` #1931

Closed paili0628 closed 2 months ago

paili0628 commented 3 months ago

Improve comb_prop so that it handles the case where

wire.in = **guard** ? 1'd1; // Assignment 1
a = wire.out ? b; // Assignment 2

in this case, comb_prop should be able to eliminate Assignment 1 and turn Assignment 2 into

a = **guard** ? b;

This optimization should effectively eliminate the need for instantiating wires separately for groups' go signals.

rachitnigam commented 3 months ago

How is this correct? If you have:

wire.in = cond ? p;
x.in = wire.out ? 15;

And I set cond = 1, p = 0. The optimized program will ignore the value of p and assign 15 to x.in.

paili0628 commented 3 months ago

How is this correct? If you have:

wire.in = cond ? p;
x.in = wire.out ? 15;

And I set cond = 1, p = 0. The optimized program will ignore the value of p and assign 15 to x.in.

So we have to check that p = 1'd1 in this case. If p=1'd0 then we are not able to swap wire.out to cond.

Oh ok there was a typo in my initial explanation. Namely, wire.in should be assigned to **guard** ? 1'd1. sorry for the confusion.

rachitnigam commented 3 months ago

So the optimization requires that we statically prove that p = 1'd1 right? That makes more sense.

paili0628 commented 3 months ago

Yep. We have to check that the source of the assignment is a 1'd1 constant before swapping the guard.

sampsyo commented 3 months ago

This makes sense, @paili0628! Thanks for updating the original description. I would be SUPER interested to know whether this has a measurable effect on the FPGA resources/timing!

At your leisure: where do the tests currently stand? It looks like some are failing—it could be cool to dive into whatever is going wrong, as a signal about what the pass needs to do next.

paili0628 commented 3 months ago

Thanks for asking! I'm currently having a painful time debugging, there seems to be a lot of chain reaction going on because of this optimization that need to be fixed. One I just understood and fixed is, for the code example below:

a.in = b.out ? 1'd1;
b.in = c.out ? 1'd1;
c.in = d ? 1'd1;

we need to parse that b and c are just propagating d.in and compile this code to:

a.in = d ? 1'd1;

Since we remove all the assignments matching the pattern

wire.in = **guard** ? 1'd1;

if we do not find the "source" of the wire signals, we will have buggy circuits.

And yet, there still seems to be something else wrong, so trying desperately to find out why...

sampsyo commented 3 months ago

Aha, got it! That one seems tricky, but nice work figuring it out. One strategy that can work for these kinds of issues in particular is to leave the now-dead code (the wire.in = guard ? 1 assignment) in place and let some other pass clean it up.

Always fun to track down the next wave of bugs! It could be cool to take some arbitrary failing test and reduce it…

rachitnigam commented 3 months ago

One strategy that can work for these kinds of issues in particular is to leave the now-dead code (the wire.in = guard ? 1 assignment) in place

comb-prop already has an option for this:

% cargo run -q -- pass-help comb-prop
- comb-prop: propagate unconditional continuous assignments
  * no-eliminate: mark dead assignments with @dead instead of removing them (default: false)
rachitnigam commented 2 months ago

@paili0628 Can you explain what the changes are supposed to do? For example, in the following program, we can't greedily rewrite the assignments:

w.in = c1 ? 1'd1;
r.in = w.out;
w.in = c2 ? 1'd1

Does the code handle this case?

paili0628 commented 2 months ago

No, the code does not handle this case. In fact I think for the current implementation it will just randomly replace w.out with one of c1 ? 1'd1 and c2?1'd1. So I guess what we might do to handle this case is replacing w.out with c1 | c2, or do you think there are just too many corner cases to make this a general optimization?

rachitnigam commented 2 months ago

I think we'd have to think a little more carefully about the data structures we want to use if we want to perform generalize this optimization more. The current approach of using arrays and iterating is kind of silly. We should use some other structure to actually represent the dataflow of the computations and then propagate the updates across wires.

Out of curiosity, what was the motivation for this change?

paili0628 commented 2 months ago

Initially I wanted to eliminate the need for a go signal that is continuously high throughout the execution of a group, so we could just at compile time replace the go signal with the guard which the go signal is propagating. Then through some group discussion we decided to make this a general optimization.

rachitnigam commented 2 months ago

I think the first step towards that goal is changing the semantics of components so that they don't require go signals to be asserted for the entire duration of the execution. Once we have that, we can change tdcc other lowering passes to generate control logic that doesn't use the FSM signal.

This optimization separately seems useful but instead of making it a part of this pass, I think the more general observation should be the following: "If a 1-bit port has exactly one assignment to it and it is of the form p = g ? 1, we can rewrite the assignment to p = g. Once that is done, this pass can just continue having the same logic and and leverage the previous rewrite.

The approach to that second goal should be generalizing the pass that rewrites static timing guards into a more generic guard-simplify pass that does these kinds of optimizations.

sampsyo commented 2 months ago

Starting simple in that way would be great! If we can work out the bugs with that restricted case, that would provide a platform to stand on to do some more generalization.

paili0628 commented 2 months ago

I'm gonna close this pr first, and then work on changing the semantics of the component.