Improving Systolic Array Generator Frontend

calebmkim commented 1 year ago

(This is from a discussion w/ @rachitnigam about potentially improving the systolic array frontend so @rachitnigam please feel free to correct me if I'm missing something).

Part of the evaluation "story" should have benchmarks that combine static and dynamic timing (HiSparse is a good example of this).

Current

Since we already have worked on the systolic array frontend, we were thinking of how to improve this. In a sense, we already combine static and dynamic timing with our latest PR, in which the contraction dimension is a dynamic argument. So # of loop iterations is dynamic, while the body of the loop/computation itself is statically timed.

(Very Abstract) Plan

However, we were thinking of ways we could potentially improve the static/dynamic mix of the systolic array to make the example less trivial. We were thinking we could expand upon this by using fused kernels. That is, kernels that perform matrix multiplication, and then some other operation on the resulting matrix. For example, we could look through literature or just real life examples, to see if there is ever a pattern of the following: matrix multiplication, followed by applying a pointwise function to each element of the resulting matrix. Ideally, the pointwise function would be dynamically timed. (Edit: some examples could be ReLU, sigmoid, or tanh. Question: would the dynamic part be "choosing" between these possible functions?)

sampsyo commented 1 year ago

Indeed, it would be really common in ML stuff to follow a matrix product (fully-connected layer) with an activation function (ReLU and friends). This is common enough that the major frameworks have fused kernels in them for this, obviating the need to (for example) write the big matrix back to memory, only to read it again to perform the pointwise activation. For example, here is the documentation for the MM primitive in Intel's DNN library, oneDNN: https://oneapi-src.github.io/oneDNN/dev_guide_matmul.html

Note the list of "post-ops" that can be enabled for the MM kernel (general elementwise function, sum, binary operation, or a generalized ReLU).

It's a little hard to think of what would be a realistic dynamic-latency operator in this kind of context… ReLU itself is not it. I wonder if a softmax kernel could somehow make sense, without getting totally crazy? Softmax would also have the benefit of being tangentially related to attention, which of course is everyone's favorite tensor computation these days.

rachitnigam commented 1 year ago

Softmax requires exponentiation which, depending on the algorithm used, could be dynamic. I don't know if our exponentiation generator is dynamic or not but we could investigate.

calebmkim commented 1 year ago

I don't know if our exponentiation generator is dynamic or not but we could investigate.

Our exponentiation implementation is dynamic iirc.

I also have a higher-level comment to make. It seems that, in our evaluation push, we have two directions. 1) Try to work on adding a dynamic operator to the systolic array stuff. That is, what we talk about in this issue. 2) Try to push on the HiSparse stuff, expanding on what @paili0628 has done so far.

Thinking about the goals of the evaluation: they are both meant to be non-trivial real life examples that combine static & dynamic timing. 2) is obviously more concrete, while 1) is more abstract. For now, I think it is probably possible for me to split time between the two, but I'm wondering if either of you guys have any thoughts about this.

rachitnigam commented 1 year ago

I think this is something that you'll have to get a sense of by taking a look at the HiSparse stuff and seeing if you're hopeful about making it work. I'd recommend using a limited amount of time before making a decision about whether or not you want to pursue HiSparse debugging.

calebmkim commented 1 year ago

So I’ve been thinking about this and I have a few thoughts.

We whould probably switch to fixed point (or floating point)

In softmax, the default base value for exponentiation is $e$. Furthermore, since the resulting values from a softmax operation going to be between 0 and 1, ints are really not going to be helpeful. So if we want to try softmax, we should probably use fixed or floating point. I think the main difficulty in changing the current systolic array implementation would be getting a pipelined multiplier working for our data type.

We could try Leaky ReLU

Leaky ReLU needs to multiply negative values by 0.01, so this could technically be dynamic, amd would probably easier to implement than softmax.

What would our implementations look like?

For Leaky ReLU: Before writing to memory, we would first do leaky ReLU on each input.

For Softmax: I found an interesting google blog (there's also a short paper accompanying it) that describes a "partially fused" softmax & matrix multiplication. It describes a way to implement Y = softmax(X) * W, where X and W are 2D matrices, without wasting writes to memory for softmax(X). The blog does describes this fusion in the context of GPUs, but its rationale (not wasting writes to memory) seems generally applicable.

How it fits into the "story" of the paper

The main thing that worries me is that, to prove the point of the paper, we want a good example of mixed static/dynamic timing. In this case, the dynamism is coming from the exponential operator. However, it seems like there are ways to compute exponentiation that are static. In fact, I think there may be ways of slightly adjusting our current implementation of fixed point exponentiation to make it statically timed. In other words, exponentiation doesn't seem "fundamentally dynamic" in the same way that, for example, SpMV does. If we go with this, I think we should have a good explanation of why exponentiation should at least sometimes be dynamic.

rachitnigam commented 1 year ago

Thanks for the write-up @calebmkim! Re: the story of the paper, I have two points:

It is often possible to transform a dynamic circuit into a static one. For example, division can be implemented using a purely static algorithm but it always takes W cycles where W is the bitwdith of the numbers. In contrast, a dynamic implementation can often return results much faster based on the number of leading zeroes. I think the hallmark of a truly dynamic implementation of exp would be seeing whether it can return outputs in different amounts of time based on their value.
You are right that HiSparse has a somewhat more "fundamental" dynamic dataflow because it can have bank conflicts and such. I think we should use the first point to highlight why we think this is a reasonable case study to start with.

sampsyo commented 1 year ago

Awesome; all extremely good points!

On leaky ReLU: I really like this idea!! It may not exactly be rocket science ("leaky ReLU" is just x if x > 0 else x * 0.01), but it is absolutely definitely no-bones-about-it dynamic (it fundamentally depends on the input value whether we do nothing (1 cycle) or a multiplication (N cycles)). This is simple and makes it crystal clear why some computations want to be dynamic. If it's possible, I think we should go for this because it seems like a self-contained extension to the MM we already have and is sorta the minimal route to something meaningfully dynamic.

On softmax: Really neat find w/r/t the fused MM/softmax thing! This seems potentially exciting as well, but it also seems a lot more complicated than just tacking on a pointwise scalar function. So maybe we should put this on the roadmap (i.e., extend our GitHub Project thing) for some future phase, and just run with Leaky ReLU for now?

calebmkim commented 1 year ago

So maybe we should put this on the roadmap (i.e., extend our GitHub Project thing) for some future phase, and just run with Leaky ReLU for now?

Sounds good to me! The fused softmax/MM thing is cool, but tbh I really wasn't sure how to implement it, and Leaky ReLU is arguably more "fundamentally dynamic" than softmax.

calebmkim commented 1 year ago

One thing I was just thinking of regarding LeakyReLU:

(Not sure if this is a valid counterpoint/concern): It would be possible to statically time the Leaky ReLU stuff: you can use pipelined multipliers so that you can feed new inputs each cycle. Then, we would write into the memory either the result of the pipelined multiplier, or the original value, based on whether the value is positive or negative. You could be done as fast (potentially faster) than the dynamically timed design, which would use non-pipelined multipliers.

The dynamically timed version would use non-pipelined multipliers, and only performs multiplications when the value is less than zero. But since these multiplications can only happen one at a time, if you come across a bunch of negative numbers that you need to perform multiplication for, you may end up taking longer since you cannot pipeline the multiplications.

I suppose the advantage of the dynamic version would be that it uses regular multipliers instead of pipelined ones so it is probably cheaper (and depending on the input can be as fast as the static version)?

Maybe this is wishful thinking at this point, but perhaps a better way to show the advantage of mixed static & dynamic is when there is a loop carried data dependency between the iterations (like some sort of accumulator that gets conditionally updated or something). It's easy to come up with made up examples where this is true, but more difficult to find in practice.

Either way, I think I'll get started on the Leaky ReLU stuff though. It seems like, at least a pretty good example illustrating a possible use of a dynamic-static mix.

sampsyo commented 1 year ago

This is a useful observation! Indeed, there is an advantage to using pipelined multipliers here.

But I have one broad question about this thought experiment: wouldn't the "consumer" side of the multiply-or-identity unit still need some dynamic control, assuming that we are writing the result into a single-ported memory?

Like, here's a very quick sketch trying to show the data flow, where the little circles are the values flowing through it:

sketch

In the end, there has to be some kind of dynamic decision arbitrating who gets to write into the memory on every cycle. Which probably means adding "back-pressure" to avoid producing things too quickly. (Without this, there would be situations where both the multiplier path and the identity path want to write a value into the memory on the same cycle, and there they have recourse.)

Does this make sense? The overall conclusion here would be that, while adding pipelining to your multiplier totally makes sense, it would not eliminate the dynamism.

calebmkim commented 1 year ago

Yeah, what you're saying makes sense.

I guess my idea was you can turn the entire leaky-ReLU operation static, by always multiplying by 0.01 whether or not the input is negative: then, once you have finished the multiply, you decide which value you're going to write. But this write is scheduled statically (i.e., exactly 4 cycles after we feed into the multiplier).

This would keep the entire thing statically timed. This will only increase the latency by a constant amount of cycles (compared to a dynamic pipelined version), since at worst, it would only be losing the time it takes to fill/drain the multiplier pipeline.

This is different than the current PR on github, which uses non-pipelined multipliers; it implements a dynamic ReLU that takes either 1 or 4 cycles, depending on the input.

sampsyo commented 1 year ago

Ah, I get it now! That's a very clever idea. You're right that this is a perfectly reasonable static pipeline for this situation.

Not that this is necessary (I think the simple version you already have is also a reasonable design point), but one way to make things "even more dynamic" would be to add parallelism… like, you have $N$ multipliers, so you can potentially get $N$-way parallelism from the Leaky ReLU Unit. But you want to keep the multipliers as busy as possible; they're under-utilized whenever there are negative inputs. I dunno, something like that could push things even more in the direction of dynamic scheduling.

calebmkim commented 1 year ago

I'm going to close this. We can continue systolic array based discussions in #8

cucapra / calyx-resource-eval