Ravenslofty / blog

The Critical Path - a rambly FPGA blog
49 stars 2 forks source link

06/02/2022: Keep Calm and Carry On #7

Open Ravenslofty opened 2 years ago

Ravenslofty commented 2 years ago

Keep Calm and Carry On

Wow, it's been a while since I last wrote a blog article. Sorry about that.

Credit where credit is due: I learned this trick from the Altera Synthesis Cookbook.

Most modern FPGAs come with a carry chain: hard logic to quickly propagate the carry bit of a sum into the next bit.

If you think about a 1-bit full adder, there are some useful properties related to carries:

Reduce-AND With A Carry Chain

In Verilog, using unary-& it is possible to AND every bit of a value together (which is 1 if every bit of the input is 1). This is called "reduce-AND", and by itself it's not very common, but it's a fantastic building block.

If you think about how reduce-AND works, it is equivalent to asking if every pair of bits in the signal propagates a carry. In other words, &x => most_significant_bit(x + 1). Using that identity, we can map a reduce-AND to a carry chain, getting the + 1 term by setting the carry-in to 1.

We can do better though. Since FPGAs are made up of K-input lookup tables (K-LUTs), we can first do a K-bit-wide reduce-AND using a LUT before going into the carry chain. On an architecture like the Lattice ECP5, this is free since the LUT is before the carry chain, so the bits would have to go through the LUT anyway. This transforms the problem into &x => &{&x[K-1:0], &x[2*K-1:K], ...} => most_significant_bit({&x[K-1:0], &x[2*K-1:K], ...} + 1).

Notably this is not free on the Lattice iCE40; because the carry chain is before the LUT you need to use another LUT to perform this. This leads to the interesting property that subtraction is more expensive on iCE40 than addition because you need a LUT for input inversion.

When I bring this idea up, I often get somebody mentioning big-O notation, since this approach would be O(n/K) compared to using a tree of K-LUTs being O(log_K(n)). However, propagating carries is really fast, so this approach is significantly faster than a tree of LUTs in practice.

Reduce-OR With A Carry Chain

Since reduce-AND is a thing, Verilog also includes reduce-OR as unary-| (which is 1 if any bit of the input is 1).

The core of the idea here is that through De Morgan's laws, one can transform |x into ~(&(~x)).

Since the K-LUTs are doing small reduce-ANDs already, we can pack the ~x terms into them by performing reduce-NOR (which doesn't exist in Verilog, but I hope makes sense) instead. This then means the carry bit is 1 if all bits of the input are zero, which we can then invert somewhere to form the final reduce-OR.

The Gowin LittleBee FPGAs strongly resemble the Lattice ECP5, but because of a feature the ECP5 has but the LittleBee does not, reduce-OR like this is less efficient than reduce-AND. It's still faster than a tree of LUTs however.

Equality With A Carry Chain

There are two ways of implementing this in practice: one way for equality between two signals, and another, more efficient way for equality between a signal and a constant.

The equality-between-two-signals approach is relatively straightforward. Chunk the input into K/2-bit equalities, and then reduce-AND them together: A == B => &{A[K/2 - 1:0] == B[K/2 - 1:0], A[K-1:K/2 - 1] == B[K-1:K/2 - 1]}.

The equality-between-signal-and-constant approach is a little more complex. Since the constant itself can be stored in the LUT, instead of needing to share LUT inputs between A and B, the inputs can be dedicated to the constant. This results in A == B => &{A[K-1:0] == B[K-1:0], A[2*K-1:K] == B[2*K-1:K]}.

The equivalent transform can be made for inequality, as well.

The General Idea

Fundamentally, what this logic is doing is expressing a function as an AND of some other logic (this is "product of sums" form if the "other logic" is just OR, but there's nothing requiring it to be OR). If the resulting function is relatively simple, it will probably map fairly efficiently to a carry chain.