0xPolygonZero / plonky2

Apache License 2.0
777 stars 289 forks source link

Investigate other field options #1

Closed dlubarov closed 3 years ago

dlubarov commented 3 years ago

We're currently using p = 2^64 - 9 * 2^28 + 1, which I figured would enable a simple and fast Crandall reduction. It's reasonably fast, but seems like there's room for improvement.

We could probably get some speedup by switching to a slightly smaller field and delaying some reductions, although I expect that this would be fairly minor, since GMiMC doesn't do many additions in between multiplications. My code might also have room for improvement independent of field choice.

@unzvfu suggested looking into binary fields, which should be very fast on CPUs. There may be some complications with using FRI with binary fields, but it seems worth looking into.

A much more speculative idea is to use a field that isn't 2-adic, like p = 2^61 - 1. This would complicate LDEs and FRI. For LDEs, I suppose we could either use mixed-arity FFTs (might be expensive to do the high-arity layers), or use some other DFT algorithm with no smoothness requirement (not sure how practical they are). For FRI, we typically want pretty high-arity reductions anyway (like 4-to-1 or 8-to-1 layers), so it might not affect efficiency that much, but it would be a bit more complex.

It would be interesting to see what ~64 bit fields other projects have used, but I found very little online. AFAIK they don't come up in any widely used crypto primitives.

dlubarov commented 3 years ago

Assigning to @unzvfu as the "chief investigator", but I and/or @wborgeaud might help out if we end up exploring several options.

dlubarov commented 3 years ago

There's also the Monty option. If I'm remembering correctly, it seems like it should involve the same number of operations as our current Crandall reduction (two muls, two adds and a conditional sub), but I might be off.

unzvfu commented 3 years ago

Here is a draft summary of the field options that I've investigated so far, as well as some of their trade-offs. It'll probably remain a work in progress, and there are no doubt some minor improvements to be made at various points that I have missed and that we can update as we learn them.

The criteria for the field F of size q we're interested in are:

One slightly surprising discovery is that rustc actually prefers mul's to combinations of shifts and adds, while gcc does not; so the advantage of a prime with a special form for arithmetic is almost completely negated with pure rust. I will check which one is actually faster and update; presumably it will depend on the machine architecture to some extent.

That lead me to consider what the best form would be from the point of view of the other criteria; this suggests we use something of the form q = s*2^n + 1, with s small and not divisible by 3, 5 etc., to maximise the 2-adicity and control the permutation monomials. Then arithmetic in Montgomery form will provide as about as good performance as you can expect (REDC with two muls and three adds). For example, depending on how big q and n have to be, we could pick q = 5*2^55 + 1, or q = 71*2^57 + 1, etc.

(These choices of q have mu = -q^-1 = q-2 (mod 2^64), which is enticing, though I can't yet see how to use it to replace the multiplication-by-mu modulo 2^64 part of REDC with (say) multiplication by -2, which would be great. In some cases you can replace a mul-by-mu with shifts and adds: For example, x (52^55-1) mod 2^64 is the same as ((x << 57) + (x << 55) - x), which gcc still thinks is faster than the multiplication.)

Probably the fastest modular multiplication possible would be the binary fields (costs just a couple of shifts and xors; see pseudo-code below), though they obviously have some significant disadvantages relating to the 2-adicity and permutation monomial selection criteria.

The following table summarises the "special form" options I've looked into, with varying degrees of detail (doesn't track conditional subs yet; sometimes not needed, e.g. with binary field). The "small factors" column tells us the 2-adicity (i.e. the exponent of 2) as well as which degree permutation monomials are available (namely, products of primes that don't appear in the list); all prime power factors up to and including 11 are shown.

group form potential candidates small factors of q-1 pseudocode implementation notes
Mersenne 2n-1 261 - 1 2⋅32⋅52⋅7⋅11... hi,lo=MUL(a,b);
(lo&(1<<61)-1) + (lo>>61) + (hi<<3)
Space for redundant repr.
pseudo-Mersenne/ Crandall 2n-c
small c
261 - 31 25⋅3⋅5 ... hi,lo=MUL(a,b);
two muli's by c, three adds
same
263 - 25 2⋅35 ... same same
binary fields 2n F2[x]/(x64+x4+x3+x+1) 3⋅5 ... hi,lo=CLMUL(a,b);
3 shl, 4 xor
3 shl, 4 xor
shl's have one immed. operand; shl's are independent, can tree xor's; CLMUL instructions are same speed as IMUL from Haswell on. Possible improvement with lookup tables and SSE
F2[x]/(x63 + x + 1) 72 ... hi,lo=CLMUL(a,b);
1 shl, 1 xor, 1 and
1 shl, 1 xor, 1 and
as above; and's have one immed. operand; might be able to remove the and's?
Goldilocks 22m-2m±1 264 - 232 + 1 232⋅3⋅5 ... hi,lo=MUL(a,b);
hihi=hi>>32;
hilo=hi&(1<<32)-1;
lo
+ (hilo<<32) - hilo
- hihi
Note that (hilo<<32) - hilo can't underflow; each part of the reduction line may carry (resp. borrow), these could cancel, or result in conditional add/sub of 232 - 1; not sure if that could cause another carry/borrow
Solinas poly(2t) 264 - 9⋅228 + 1 228⋅11 ... 1. same as for Crandall; or
2. Use the structure of 9⋅228-1=231+228-1
...

Please let me know if anything is unclear and if anything looks like it might be worth investigating a bit further. My next step from here is look closer at the effect of field choice on the FFT and check whether rustc or gcc is right about muls versus shifts+adds.

dlubarov commented 3 years ago

This is great info!

Seems larger is better for FFT, but I still need to understand the precise connection

I think it's just a matter of what size circuits we want to support. Like if we wanted to support circuits with 2k gates, the prover will generate polynomials of degree 2k, compute low-degree extensions of them (resulting in Reed-Solomon codewords of size 2k + 3 or so), then commit to those LDEs.

Our current approach to computing LDEs is to run a (radix-2) IFFT to convert points to coefficients, zero-pad the coefficient list, and run a (radix-2) FFT to convert back to points. My understanding is that a radix-2 FFT of size 2k + 3 requires a multiplicative subgroup of order 2k + 3, so the field must have a 2-adicity of k + 3 of higher. (There might be other approaches we could consider, like using mixed-radix FFTs, or some other DFT algorithm that doesn't require a smooth subgroup. I suspect those alternatives may be slower though.)

I would estimate that the largest circuits anyone would want to use would have around 225 gates, since proofs for such large circuits would take a lot of time and memory. So I think it would be good to have a 2-adicity of around 28+, although this is just a rough number. We could go a few bits lower if there was a performance reason, or if we could get higher 2-adicity with no performance difference, that might be nice just in case someone wanted to use a giant circuit.

need to find out if it's helpful for the 2-adicity itself to have a special form, e.g. a power of 2.

As far as I'm aware, this shouldn't affect things.


That's interesting about the mul vs shifts/adds. I suppose the mul would be more compact, though that's not a priority for us.

I just noticed that the toolchain supports profile guided optimization. That would be nice if the toolchain could profile mul vs shifts/adds and select the winner for the local machine. I'm not sure if it supports that sort of thing though; the docs give me the impression that it's just low-level stuff like register assignments.


That's an excellent table! So to try to summarize,

dlubarov commented 3 years ago

On second thought, I'm not sure Mersenne would be a viable option since its 2-adicity is only 1. I'm not sure how practical it would be to compute DFTs and run FRI in a field like that (see the initial comment in this issue). Between that and the high exponent we'd have to use to get a permutation, it seems likely that Goldilocks would be the better fit if we used Poseidon.

dlubarov commented 3 years ago

Interestingly, Bobbin timed our field code as well as his own field code, and got around 1.3ns per field mul for both, much faster than the ~4ns per field mul I got on my laptop. No idea why mine is slower, but I think 1.3ns is believable. Assuming his i9-9980HK was Turbo Boosting to 5GHz, that's 6-7 cycles per field mul, which seems about right for the three 64-bit muls and the few additions etc.

If we can reproduce the performance that Bobbin saw, and get that kind of perf on most hardware, that should make Poseidon pretty practical.

unzvfu commented 3 years ago

Thanks heaps for the additional info @dlubarov! I agree with your summary of the table.

Regarding muls vs shift/adds, on practically all recent Intel architectures we have

op latency throughput
add/sub 1 4
shl/shr 1 2
mul 3 1

So computing, say, (x << 57) + (x << 55) - x will require three cycles because of the data dependency (the shifts can happen simultaneously (assuming both shifting ports are free)). So it's actually pretty hard to beat mul, except possibly if there are several shift/add-based multiplications to do (the CPU can do two of them simultaneously if they are independent, but that's not the case for us I don't think).

Anyway, I'll add some benchmarks for some of the different field options.

dlubarov commented 3 years ago

Quick update/clarification on the timing of field ops: my previous figure of ~4ns per field mul was based on a sequential chain of field muls. Specifically, it was doing naive exponentiation by repeated multiplication. I added another benchmark (bench_field_mul_interleaved) with a more parallel dependency structure, and I'm now seeing much better numbers: ~.9ns, or ~4 cycles, per field mul.

I looked a bit at the asm of both, and didn't see anything weird or surprising like auto-vectorization. I think our current field mul code just has a very sequential dependency structure, so if we do a single field mul in a loop, the CPU can't do much else while it waits for muls to complete. Doing several independent field muls in a loop gives much better throughput.

This might be a point in favor of Poseidon, since its dependency structure should permit more interleaving of field ops, whereas GMiMC has a very sequential dependency structure. I'm just speculating though; we'll see soon enough when we benchmark it.

unzvfu commented 3 years ago

I've implemented the three main prime/reduction algorithm pairs and timed them on the field_benchmarks branch.

Here are the timings in nanoseconds, and approximate number of cycles (assuming boost), on an i7-7700hq @ 2.8 GHz (3.8 GHz boost). Proth-5 is 5⋅255 + 1 and Proth-29 is 29⋅257 + 1. The Serial timing measures the average cost per modular multiplication when they're done one after the other, whereas the Interleaved timing measures the average cost when several are interleaved at once.

Prime Serial Interleaved  
  ns cyc ns cyc
Crandall 4.06 15.5 0.98 3.7
Goldilocks 3.97 15.1 1.22 4.6
Proth-5 2.16 8.2 0.67 2.5
Proth-29 2.58 9.8 0.87 3.3

The main observation is that Goldilocks doesn't perform as well as we had hoped. I spent a while trying different variations of the implementation, but it was very hard to remove the data dependencies caused by overflow/underflow on the addition/subtractions at the end. Of all the things I tried, the fastest version was:

fn reduce128(x: u128) -> GoldilocksField {
    // Write x = a0 + a1*2^64  (0 <= a0, a1 < 2^64)
    //         = b0 + b1*2^32 + b2*2^64 + b3*2^96  (0 <= b0, b1, b2, b3 < 2^32)
    const MASK_LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (a0, a1) = split(x);   // This is a no-op
    let b3 = a1 >> 32;
    let b2 = a1 & MASK_LO_32_BITS;
    let x = (b2 << 32) - b2;  // Can't underflow
    let (y, cy) = x.overflowing_add(a0);
    let (mut z, br) = y.overflowing_sub(b3);

    if cy {
        z = z.overflowing_sub(GoldilocksField::ORDER).0;
    }
    if br {
        z = z.overflowing_add(GoldilocksField::ORDER).0;
    }
    GoldilocksField(z)
}

which compiles down to

reduce128:
        ; x = a0 + a1*2^64 arrives in (rdi, rsi) = (a0, a1)
        mov     rcx, rsi
        shr     rcx, 32
        mov     eax, esi
        shl     rsi, 32
        sub     rsi, rax
        lea     rax, [rsi + rdi]
        sub     rax, rcx
        mov     edx, 4294967295
        add     rdx, rax
        add     rsi, rdi
        cmovae  rdx, rax
        movabs  rax, -4294967295
        add     rax, rdx
        cmp     rsi, rcx
        cmovae  rax, rdx
        ret

Happy to hear about any ideas for improvements!

The other observation is that the Proth reduction is very fast; in fact the interleaved Proth-5 is almost optimal, since the 64-bit multiplication itself is 2 cycles. The caveat is that it is only partial reduction, though it is compatible with the FFT in a nice way (cf. this comment on issue 8).

Overall, I think these timings give a pretty good idea of what is possible with different choices of prime, and the unfortunate conclusion is that there's no stand-out winner. I think it will be important to keep the code fairly generic to allow switching between different possibilities as I experiment with improving the FFT code. For example, the Proth-29 prime beats Crandall by about 10%, but has some extra bookkeeping required because of the way it changes residue class representation along the way; if the bookkeeping isn't too bad, that could be enough to swing things it its favour.

dlubarov commented 3 years ago

Neat! It's interesting that the compiler not only (ab)uses lea to compute 5 * c0 in Proth-5,

lea     rax, [rcx, +, 4*rcx]

but also to compute 29 * c0 as 3 * 9 * c0 + c0 + c0 in Proth-29,

lea     rax, [rcx, +, 8*rcx]
lea     rax, [rax, +, 2*rax]
add     rax, rcx
add     rax, rcx
dlubarov commented 3 years ago

A couple thoughts on the Proth options:

unzvfu commented 3 years ago

Yeah, I'm actually a bit confused about the use of lea for mult-by-29, since it has a latency of 3 when all three operands are used and is restricted to CPU port 1 (so those two leas can't be done in parallel); that's exactly the same as mul from at least Broadwell onwards AFAICT. The two adds at the end have a serial dependency too, which makes matters worse. Curious!

dlubarov commented 3 years ago

Another little thought -- since MASK_LO_55_BITS requires a movabs, I thought it might be slightly faster to use two shifts, like

fn reduce128(x: u128) -> ProthField {
    let x_shl9 = x << 9;
    let c0 = x_shl9 as u64 >> 9;
    let c1 = (x_shl9 >> 64) as u64; // free
    let (d, under) = (5 * c0).overflowing_sub(c1);
    ProthField(d)
}

but the compiler switches it back to an AND. I guess the compiler (and you) know probably better :)

unzvfu commented 3 years ago

A couple thoughts on the Proth options:

* I think it's totally fine in our case to have fast arithmetic for non-canonical encodings and a slower "to canonical" function.

* At first glance, it seems like underflow can happen, e.g. if `c0 == 0`. Do we need to conditionally add `q`? Your current code matches K-RED in the paper, but I guess the subtraction in the paper means (wrapping) field subtraction?

Yeah, I probably shouldn't vaunt the qualities of the Proth primes too soon, since, as you noticed, I didn't do a fully correct implementation. Underflow can definitely happen, though this might be manageable by using a signed representation instead of unsigned. Otherwise, yes, we'll have to conditionally add q on underflow.

dlubarov commented 3 years ago

Here are all the potentially suitable Proth primes whose coefficients are within 1 of a power of two. I figured these should be slower than Proth5 (but with a more desirable field size) but faster than Proth29:

More Proth fields - 2287828610704211969 - = (2^7 - 1) * 2^54 + 1 - Mul group factors: 2^54 * 127 - Size: ~61 bits - 2308094809027379201 - = (2^10 + 1) * 2^51 + 1 - Mul group factors: 2^51 * 5^2 * 41 - Size: ~61 bits - 4611615649683210241 - = (2^16 - 1) * 2^46 + 1 - Mul group factors: 2^46 * 3 * 5 * 17 * 257 - Size: ~62 bits - 9223336852482686977 - = (2^18 - 1) * 2^45 + 1 - Mul group factors: 2^45 * 3^3 * 7 * 19 * 73 - Size: ~63 bits - 1152923703630102529 - = (2^19 + 1) * 2^41 + 1 - Mul group factors: 2^41 * 3 * 174763 - Size: ~60 bits - 9223369837831520257 - = (2^22 - 1) * 2^41 + 1 - Mul group factors: 2^41 * 3 * 23 * 89 * 683 - Size: ~63 bits - 1152921513196781569 - = (2^27 + 1) * 2^33 + 1 - Mul group factors: 2^33 * 3^4 * 19 * 87211 - Size: ~60 bits

I added (2^16 - 1) * 2^46 + 1 along with Proth29 to the benchmark. Here are the results on my M1:

Benchmark results ``` Field: Crandall Serial: average field mul: 4.51690975ns Interleaved: average field mul: 0.9451840138333333ns Field: Goldilocks Serial: average field mul: 4.212927083ns Interleaved: average field mul: 1.1893468193333334ns Field: Proth5 Serial: average field mul: 3.10666875ns Interleaved: average field mul: 0.8130153401666667ns Field: Proth29 Serial: average field mul: 3.190831083ns Interleaved: average field mul: 0.7355682568333334ns Field: ProthField65535 Serial: average field mul: 2.996222ns Interleaved: average field mul: 0.8194518958333333ns ```

And the assembly for each:

ProthField5 mul ```asm umulh x8, x1, x0 mul x9, x1, x0 and x10, x9, #0x7fffffffffffff extr x8, x8, x9, #55 add x9, x10, x10, lsl, #2 subs x8, x9, x8 mov x9, #1 movk x9, #640, lsl, #48 csel x9, x9, xzr, lo add x0, x9, x8 ret ```
ProthField29 mul ```asm umulh x8, x1, x0 mul x9, x1, x0 and x10, x9, #0x1ffffffffffffff extr x8, x8, x9, #57 mov w9, #29 mul x9, x10, x9 subs x8, x9, x8 mov x9, #1 movk x9, #640, lsl, #48 csel x9, x9, xzr, lo add x0, x9, x8 ret ```
ProthField65535 mul ```asm umulh x8, x1, x0 mul x9, x1, x0 and x10, x9, #0x3fffffffffff extr x8, x8, x9, #46 lsl x9, x10, #16 sub x9, x9, x10 subs x8, x9, x8 mov x9, #1 movk x9, #640, lsl, #48 csel x9, x9, xzr, lo add x0, x9, x8 ret ```

I'm no ARM expert, but the ASM is pretty much as expected: Proth5 uses an add to multiply by 5, Proth29 uses a mov+mul (since there's no great way to multiply by 29), and Proth65535 uses a lsl+sub.

The weird thing is just that Proth29 is somehow faster for me, despite the extra mul.

My (minor) code changes are in field_benchmarks_daniel.

dlubarov commented 3 years ago

Going back to this,

A much more speculative idea is to use a field that isn't 2-adic, like p = 2^61 - 1.

A recent paper by Ben-Sasson et al. may make this more practical. The authors are planning another paper to talk about adapting their techniques to arbitrary fields.

I haven't had much time to read it yet, but would be interesting to see if concrete performance is really comparable to traditional FFTs.

nbgl commented 3 years ago

This was a really interesting thread to read!

The problem you bumped into with Goldilocks is that LLVM does not emit fast assembly for modulo addition (i.e. the thing where we do let (z, carry) = x.overflowing_add(y); if carry { z.overflowing_sub(FIELD_ORDER).0 } else { z }). Crandall has the same problem, but it only does one such addition, whereas Goldilocks does two.

The code

fn reduce128(x: u128) -> GoldilocksField {
    const LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (x_lo, x_hi) = split(x);

    let x_hi_hi = x_hi >> 32;
    let (mut t0, borrow) = x_lo.overflowing_sub(x_hi_hi);
    if borrow {
        t0 = t0.overflowing_sub(EPSILON).0;
    }

    let x_hi_lo = x_hi & EPSILON;
    let t1 = x_hi_lo * EPSILON;

    let (mut t2, carry) = t1.overflowing_add(t0);
    if carry {
        t2 = t2.overflowing_add(EPSILON).0;
    }
    GoldilocksField(t2)
}

indeed benchmarks slower than Crandall. Weirdly, writing

t0 = t0.overflowing_sub(EPSILON * (borrow as u64)).0;
t2 = t2.overflowing_add(EPSILON * (carry as u64)).0;

improves the throughput by a lot, although it’s still marginally slower than Crandall.

The latter version generates the following assembly:

mov                rdx, rbx
mulx               r10, rbx, [rbp - 192]
mov                edx, r10d
shr                r10, 32
sub                rbx, r10
mov                ecx, 0
cmovb              rcx, r12
add                rcx, rbx
imul               rdx, r13
add                rdx, rcx
mov                ebx, 0
cmovb              rbx, r13
add                rbx, rdx

The bottleneck here is the number of fused-domain µops, at 14. A Skylake processor can only execute 4 per cycle, limiting the throughput to 3.5 cycles per multiplication.

We can do better by doing the compiler’s job for it with inline asm:

fn reduce128(x: u128) -> GoldilocksField {
    const LO_32_BITS: u64 = (1u64 << 32) - 1u64;
    let (x_lo, x_hi) = split(x);

    let x_hi_hi = x_hi >> 32;
    let mut t0 = x_lo;
    let mut s0 = x_hi_hi;
    unsafe {
        asm!(
            "sub {0}, {1}",
            "sbb {1:e}, {1:e}",
            inlateout(reg) t0,
            inlateout(reg) s0,
            options(pure, nomem, nostack)
        );
    }
    t0 = t0.overflowing_sub(s0).0;

    let x_hi_lo = x_hi & EPSILON;
    let t1 = x_hi_lo * EPSILON;

    let mut t2 = t1;
    let mut s2 = t0;
    unsafe {
        asm!(
            "add {0}, {1}",
            "sbb {1:e}, {1:e}",
            inlateout(reg) t2,
            inlateout(reg) s2,
            options(pure, nomem, nostack)
        );
    }
    t2 = t2.overflowing_add(s2).0;

    GoldilocksField(t2)
}

This compiles to:

mov                rdx, r8
mulx               r10, r8, [rbp - 176]
mov                edx, r10d
shr                r10, 32
sub                r8, r10
sbb                r10d, r10d
sub                r8, r10
imul               rdx, r12
add                rdx, r8
sbb                r8d, r8d
add                r8, rdx

This assembly has a Goldilocks-specific trick, which is that sbb r10d, r10d sets r10 to 2^32 − 1 if the previous instruction generated a carry and to 0 otherwise. Since it so happens that EPSILON = 2^32 − 1, we can use this to conditionally add or subtract EPSILON.

This assembly has 12 fused-domain µops. The execution ports are 2×p1, 1×p5, 3×p06, 4×p0156. As there is no pressure on any particular port, the bottleneck still the number of fused-domain µops, limiting the theoretical throughput to 3 cycles per multiplication on Skylake. Ice Lake can execute 5 fused-domain µops per cycle, so on those processors the theoretical throughput is 2.5 cycles per multiplication.

For fair comparison, here’s my best Crandall reduction code:

fn reduce128(x: u128) -> CrandallField {
    let (lo_1, hi_1) = split(x);
    let (lo_2, hi_2) = split((EPSILON as u128) * (hi_1 as u128) + (lo_1 as u128));
    let lo_3 = hi_2 * EPSILON;

    let mut x = lo_2;
    let mut z = 0u64;
    unsafe {
        asm!(
            "add {0}, {1}",
            "cmovc {2}, {3}",
            inout(reg) x,
            in(reg) lo_3,
            inlateout(reg) z,
            in(reg) EPSILON,
            options(pure, nomem, nostack)
        );
    }
    x = x.overflowing_add(z).0;
    CrandallField(x)
}

It compiles to

mov                rdx, rax
mulx               rdx, rax, [rbp - 176]
mulx               rsi, rcx, r12
add                rcx, rax
adc                rsi, 0
imul               rsi, r12
xor                eax, eax
add                rcx, rsi
cmovb              rax, r12
add                rax, rcx

It also has 12 fused-domain µops. The execution ports are 3×p1, 2×p5, 2×p06, 3×p0156. Again, we’re limited to 3 cycles per multiplication by the fused-domain µops on Skylake. We’re also limited to 3 cycles per multiplication by the capacity of port 1. On Ice Lake, the latter does not change so our theoretical throughput still maxes out at 3 cycles per multiplication.

The stats are Method TP / Bench (Skylake) TP (Ice Lake) TP (Zen 3) TP / Bench (M1)
Crandall (100% Rust) 3 / .90ns 3 3 2.5 / .78ns
Crandall (inline ASM) 3 / .87ns 3 3 -
Crandall (vector) 3 / .97ns 2.06 2.25
Goldilocks (100% Rust) 3.5 / .93ns 3 2.75 2 / .71ns
Goldilocks (inline ASM) 3 / .85ns 2.5 2.25 1.67 / .69ns
Goldilocks (vector) 2.5 / .79ns 1.62 1.88

Interestingly, Goldilocks is consistently a little bit faster than Crandall on my Skylake even though they have the same theoretical throughput. I'd speculate it's because the Crandall throughput calculation assumes that port 1 is only doing multiplication, but in practice scheduling isn't perfect.

nbgl commented 3 years ago

I’ve updated the above table with throughput calculations for AMD Zen 3 and vectorized versions of both Crandall and Goldilocks.

Skylake and Zen 3 support AVX2 (256-bit vectors), whereas Ice Lake has AVX-512 (512-bit).

The main bottleneck in all the vectorized methods is the fact that AVX does not support full 64x64->128-bit multiplication, so we have to cobble it together from 32x32->64-bit multiplications. AVX-512 does has 64x64->64-bit multiplication, but that’s not useful because we need the carry to form the high word. The AVX2 version of multiplication looks like this:

unsafe fn shift(x: __m256i) -> __m256i {
    _mm256_xor_si256(x, sign_bit())
}

unsafe fn mul64_64_s(x: __m256i, y: __m256i) -> (__m256i, __m256i) {
    let x_hi = _mm256_srli_epi64(x, 32);
    let y_hi = _mm256_srli_epi64(y, 32);
    let mul_ll = _mm256_mul_epu32(x, y);
    let mul_lh = _mm256_mul_epu32(x, y_hi);
    let mul_hl = _mm256_mul_epu32(x_hi, y);
    let mul_hh = _mm256_mul_epu32(x_hi, y_hi);

    let res_lo0_s = shift(mul_ll);
    let res_hi0 = mul_hh;

    let res_lo1_s = _mm256_add_epi32(res_lo0_s, _mm256_slli_epi64(mul_lh, 32));
    let res_hi1 = _mm256_sub_epi64(res_hi0, _mm256_cmpgt_epi64(res_lo0_s, res_lo1_s)); // Carry.

    let res_lo2_s = _mm256_add_epi32(res_lo1_s, _mm256_slli_epi64(mul_hl, 32));
    let res_hi2 = _mm256_sub_epi64(res_hi1, _mm256_cmpgt_epi64(res_lo1_s, res_lo2_s)); // Carry.

    let res_hi3 = _mm256_add_epi64(res_hi2, _mm256_srli_epi64(mul_lh, 32));
    let res_hi4 = _mm256_add_epi64(res_hi3, _mm256_srli_epi64(mul_hl, 32));

    (res_hi4, res_lo2_s)
}

There’s a few tricks to note here:

  1. We don’t need to clear the high doubleword before performing multiplication, because it gets ignored anyway.
  2. _mm256_cmpgt_epi64 returns -1 on true and 0 otherwise; hence the carry is subtracted, not added to the high word.
  3. AVX2 does not have unsigned compare for 64-bit integers, so we use signed compare on numbers that have been shifted by 2^63 (trick from Hacker’s Delight). shift adds 2^63, which is then propagated through adds so we don’t have to do it multiple times, and undone at the end. AVX-512 does have unsigned 64-bit compare, so we don’t need this trick and we save 2 instructions.

Goldilocks multiplication is then implemented (nb: this code is not tested) as

unsafe fn sub_modulo_64s_64_s(x_s: __m256i, y: __m256i) -> __m256i {
    let t0_s = _mm256_sub_epi64(x_s, y);
    let carry_mask = _mm256_cmpgt_epi64(t0_s, x_s);
    let adj = _mm256_and_si256(carry_mask, epsilon());
    let t1_s = _mm256_sub_epi64(t0_s, adj);
    t1_s
}

unsafe fn add_modulo_64s_64_s(x_s: __m256i, y: __m256i) -> __m256i {
    let t0_s = _mm256_add_epi64(x_s, y);
    let carry_mask = _mm256_cmpgt_epi64(x_s, t0_s);
    let adj = _mm256_and_si256(carry_mask, epsilon());
    let t1_s = _mm256_add_epi64(t0_s, adj);
    t1_s
}

unsafe fn reduce128s_s(x_s: (__m256i, __m256i)) -> __m256i {
    let (hi0, lo0_s) = x_s;
    let hi_hi0 = _mm256_srli_epi64(hi0, 32);
    let lo1_s = sub_modulo_64s_64_s(lo0_s, hi_hi0);
    let t1 = _mm256_mul_epu32(hi0, epsilon());
    let lo2_s = add_modulo_64s_64_s(lo1_s, t1);;
    lo2_s
}

impl GoldilocksFieldVec {
    pub unsafe fn mul(x: __m256i, y: __m256i) -> __m256i {
        shift(reduce128s_s(mul64_64_s(x, y)))
    }
}

It involves one more 32x32->64-bit multiplication because it’s cheaper than shift + subtract. But it does exploit Goldilocks’ nice properties.

Crandall is a bit longer:

unsafe fn add_with_carry128_64s_s(x: (__m256i, __m256i), y_s: __m256i) -> (__m256i, __m256i) {
    let (x_hi, x_lo) = x;
    let res_lo_s = _mm256_add_epi64(x_lo, y_s);
    let carry = _mm256_cmpgt_epi64(y_s, res_lo_s);
    let res_hi = _mm256_sub_epi64(x_hi, carry);
    (res_hi, res_lo_s)
}

unsafe fn add_with_carry128s_64_s(x_s: (__m256i, __m256i), y: __m256i) -> (__m256i, __m256i) {
    let (x_hi, x_lo_s) = x_s;
    let res_lo_s = _mm256_add_epi64(x_lo_s, y);
    let carry = _mm256_cmpgt_epi64(x_lo_s, res_lo_s);
    let res_hi = _mm256_sub_epi64(x_hi, carry);
    (res_hi, res_lo_s)
}

unsafe fn fmadd_64_32_64s_s(x: __m256i, y: __m256i, z_s: __m256i) -> (__m256i, __m256i) {
    let x_hi = _mm256_srli_epi64(x, 32);
    let mul_lo = _mm256_mul_epu32(x, y);
    let mul_hi = _mm256_mul_epu32(x_hi, y);
    let tmp_s = add_with_carry128_64s_s(
        (_mm256_srli_epi64(mul_hi, 32), _mm256_slli_epi64(mul_hi, 32)),
        z_s,
    );
    add_with_carry128s_64_s(tmp_s, mul_lo)
}

unsafe fn reduce128s_s(x_s: (__m256i, __m256i)) -> __m256i {
    let (hi0, lo0_s) = x_s;
    let (hi1, lo1_s) = fmadd_64_32_64s_s(hi0, epsilon(), lo0_s);
    let lo2 = _mm256_mul_epu32(hi1, epsilon());
    let res_wrapped_s = _mm256_add_epi64(lo1_s, lo2);
    let carry_mask = _mm256_cmpgt_epi64(lo1_s, res_wrapped_s); // all 1 if overflow
    let res_s = _mm256_add_epi64(res_wrapped_s, _mm256_and_si256(carry_mask, epsilon()));
    res_s
}

impl CrandallFieldVec {
    pub unsafe fn mul(x: __m256i, y: __m256i) -> __m256i {
        shift(reduce128s_s(mul64_64_s(x, y)))
    }
}

The full Goldilocks multiplication consists of 30 instructions. With 3 AVX2 execution ports, Skylake runs it at 10 cycles per vector or 2.5 cycles per word. The Crandall multiplication is 36 instructions for a throughput of 12 cycles per vector, or 3 per word.

The AVX-512 versions are analogous. We save two instructions by not having to shift the low words by 2^63 to perform comparisons. We can also use masked addition to save an instruction every time we perform addition modulo prime order (i.e. avoiding the and in _mm_add_epi64(res_wrapped_s, _mm_and_si256(carry_mask, epsilon()))).

The AVX-512 version of Goldilocks has 26 instructions; Ice Lake has 2 AVX-512 execution ports for a throughput of 13 cycles per vector or 1.625 cycles per word. Crandall has 33 instructions, meaning 16.5 cycles per vector or 2.06 per word.

In all those cases, full multiplication is responsible for 18 out of the 30/36/26/33 instructions.

Lastly, I’m convinced that the very fastest way to implement multiplication is neither purely-scalar nor purely-vector. x86 scalar multiplication is significantly faster than anything I can do in AVX2 (although reduction is faster in vector). I would speculate that the fastest implementation tuned for Skylake would do the full 64x64->128-bit multiply in scalar and then the reduction in vector. This holds to a lesser degree on Ice Lake as well: AVX-512 64-bit full multiply is only marginally slower than scalar, but they also compete only partially, so performing some full multiplications in scalar should increase parallelism and yield more throughput. (And yes, combining domains like that is insanely ugly and means we’re optimizing not just for a specific architecture, but a specific CPU family.)

dlubarov commented 3 years ago

Thanks Jakub - very useful info. Is the data based on theoretical or observed throughput? (Or is AVX performance predictable enough that the difference isn't important?)

As Jakub pointed out, the Crandall reduction could also work with the Goldilocks field, with the same number of steps. So Goldilocks should be at least as fast on any hardware. The only downside is that Goldilocks doesn't have many options for permutation monomials (for example, x3 is not a permutation in Goldilocks, since p-1 has a factor of 3). But I think that's fine, since we're planning to use Poseidon with x7 anyway, which is a permutation in both fields.

The mixed scalar and vector approach sounds interesting. Might be good to benchmark at some point, so we could see if the improvement is significant enough to justify the weird code. But in the meantime, those AVX numbers sound really good already.

Btw, I wonder how we should select among implementations for a given CPU? We could use compile-time guards like #[target_feature(enable = "avx2")], but I don't think there's an option for AVX-512. We could use is_x86_feature_detected to check for AVX-512, but that's a runtime check. We could always add one or more features to control it, though it's not ideal since we wouldn't be inferring the best option automatically.

nbgl commented 3 years ago

Is the data based on theoretical or observed throughput? (Or is AVX performance predictable enough that the difference isn't important?)

All throughput numbers are theoretical, but the timings (only for Skylake for now) are observed.

The theoretical throughput numbers are highly indicative of real-world performance, although the correlation isn’t 100%. Also, they measure slightly different things: multiplications per cycle ≠ multiplications per unit time because the instructions that are running can affect things like turbo boost.

nbgl commented 3 years ago

I've added throughput calculations for and benchmarks for the M1. I used these instruction tables to derive the theoretical throughput numbers. They do not fully match the benchmarks, so take them with a grain of salt. (Also, while the M1 has a lot of throughput, its latencies are the same, so more unrolling is necessary to max out the speeds. Eventually you run out of registers…)

The one ASM modification I made to speed up Goldilocks is using a mul instead of a mask, shift, and subtract.

dlubarov commented 3 years ago

I think we're happy with Goldilocks so I'm going to close this for now.

My only hesitation is that at some point, we might want to have lower-degree constraints, like a STARK with degree-3 constraints (enabling constraint evaluation over a much smaller subgroup, of size 2n rather than 8n). That wouldn't work well with Goldilocks since its smallest permutation monomial is x^7.

But that's not in our immediate plans, more of a possible future exploration.