Goff for mul_assign + field implementation macro + cheaper into_repr

jon-chuang commented 4 years ago

Added goff (for mul_assign only) + macros to simplify code Have not added requirements on the prime fields/alternate montgomery multiplication for primes not satisfying requirements (https://hackmd.io/@zkteam/modular_multiplication)

jon-chuang commented 4 years ago

I can't fix the current bug, seems to be related to trying to unwrap value that evaluates to None

Edit: the bug I was referring to was the DPC test failure.

jon-chuang commented 4 years ago

Is the rustfmt error (Travis) and the dpc integration test (local) error to be ignored? I get the following locally

running 1 test
test dpc::plain_dpc::test::test_execute_constraint_systems ... FAILED

failures:

---- dpc::plain_dpc::test::test_execute_constraint_systems stdout ----
thread 'dpc::plain_dpc::test::test_execute_constraint_systems' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/5e1a799842ba6ed4a57e91f7ab
9435947482f7d8/src/libcore/macros/mod.rs:15:40
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Pratyush commented 4 years ago

Hmm I'll investigate this test in more detail.

Pratyush commented 4 years ago

That test is failing also on master, so it's not an issue with this PR.

jon-chuang commented 4 years ago

Ok, will it be merged then?

Pratyush commented 4 years ago

Yeah, I'll fix the test in another PR. Adding another couple of comments here in a second.

Pratyush commented 4 years ago

I ran some benchmarks comparing this branch to master, and it seems that the differences are mostly noise:

Benchmarks

``` name zexe.bench ns/iter goff.bench ns/iter diff ns/iter diff % speedup bls12_377::fq::bench_fq_from_repr 63 61 -2 -3.17% x 1.03 bls12_377::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00 bls12_377::fr::bench_fr_double 6 6 0 0.00% x 1.00 bls12_377::fr::bench_fr_from_repr 33 33 0 0.00% x 1.00 bls12_377::fr::bench_fr_into_repr 20 19 -1 -5.00% x 1.05 bls12_377::fr::bench_fr_inverse 5,109 5,178 69 1.35% x 0.99 bls12_377::fr::bench_fr_mul_assign 33 32 -1 -3.03% x 1.03 bls12_377::fr::bench_fr_negate 8 8 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_add_nocarry 7 7 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_div2 4 4 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_mul2 8 8 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_num_bits 4 3 -1 -25.00% x 1.33 bls12_377::fr::bench_fr_repr_sub_noborrow 8 8 0 0.00% x 1.00 bls12_377::fr::bench_fr_sqrt 31,352 33,268 1,916 6.11% x 0.94 bls12_377::fr::bench_fr_square 32 32 0 0.00% x 1.00 bls12_377::fr::bench_fr_sub_assign 7 7 0 0.00% x 1.00 bls12_381::fq::bench_fq_from_repr 64 61 -3 -4.69% x 1.05 bls12_381::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00 bls12_381::fr::bench_fr_double 6 6 0 0.00% x 1.00 bls12_381::fr::bench_fr_from_repr 33 34 1 3.03% x 0.97 bls12_381::fr::bench_fr_into_repr 21 19 -2 -9.52% x 1.11 bls12_381::fr::bench_fr_inverse 5,216 5,159 -57 -1.09% x 1.01 bls12_381::fr::bench_fr_mul_assign 34 33 -1 -2.94% x 1.03 bls12_381::fr::bench_fr_negate 8 8 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_add_nocarry 7 7 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_div2 4 4 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_mul2 8 8 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_num_bits 4 3 -1 -25.00% x 1.33 bls12_381::fr::bench_fr_repr_sub_noborrow 8 8 0 0.00% x 1.00 bls12_381::fr::bench_fr_sqrt 29,472 28,841 -631 -2.14% x 1.02 bls12_381::fr::bench_fr_square 34 34 0 0.00% x 1.00 bls12_381::fr::bench_fr_sub_assign 7 8 1 14.29% x 0.88 sw6::fq::bench_fq_from_repr 289 304 15 5.19% x 0.95 sw6::fr::bench_fr_add_assign 11 10 -1 -9.09% x 1.10 sw6::fr::bench_fr_double 14 14 0 0.00% x 1.00 sw6::fr::bench_fr_from_repr 63 61 -2 -3.17% x 1.03 sw6::fr::bench_fr_into_repr 35 32 -3 -8.57% x 1.09 sw6::fr::bench_fr_inverse 17,665 17,592 -73 -0.41% x 1.00 sw6::fr::bench_fr_mul_assign 62 60 -2 -3.23% x 1.03 sw6::fr::bench_fr_negate 14 14 0 0.00% x 1.00 sw6::fr::bench_fr_repr_add_nocarry 10 10 0 0.00% x 1.00 sw6::fr::bench_fr_repr_div2 6 6 0 0.00% x 1.00 sw6::fr::bench_fr_repr_mul2 13 13 0 0.00% x 1.00 sw6::fr::bench_fr_repr_num_bits 4 4 0 0.00% x 1.00 sw6::fr::bench_fr_repr_sub_noborrow 11 11 0 0.00% x 1.00 sw6::fr::bench_fr_sqrt 79,210 83,442 4,232 5.34% x 0.95 sw6::fr::bench_fr_square 55 59 4 7.27% x 0.93 sw6::fr::bench_fr_sub_assign 15 15 0 0.00% x 1.00 ```

Some of the difference could stem from the fact that the actual implementation by the goff folks has special cases for the bottom-most and top-most limbs, and also for the cases where some carries are zero: https://github.com/ConsenSys/goff/blob/master/cmd/internal/template/element/mul_nocarry.go

jon-chuang commented 4 years ago

I did try to implement for such special cases but did not get any significant difference in performance. My thoughts were that such special cases when particular values were 0 were optimised away by the compiler. It's clear that with optimisations I applied, into_repr is about 1.1x faster, however.

I may try again. However, before that, I'd like to implement some macros to generate the 1000-fold iter benchmarks I need.

Furthermore, it's not clear that the goff optimisation would have produced faster code, since the original code was SOS (I think), while goff improved upon CIOS.

What is the command to run the comparative benchmark?

Pratyush commented 4 years ago

Hmm it also seems that for the case of sw6::Fq, there is a ~10% slowdown in performance:

sw6::Fq

``` sw6::fq::bench_fq_double 21 21 0 0.00% x 1.00 sw6::fq::bench_fq_from_repr 273 304 31 11.36% x 0.90 sw6::fq::bench_fq_into_repr 151 136 -15 -9.93% x 1.11 sw6::fq::bench_fq_inverse 40,401 39,690 -711 -1.76% x 1.02 sw6::fq::bench_fq_mul_assign 270 304 34 12.59% x 0.89 sw6::fq::bench_fq_negate 28 28 0 0.00% x 1.00 sw6::fq::bench_fq_repr_add_nocarry 17 17 0 0.00% x 1.00 sw6::fq::bench_fq_repr_div2 7 7 0 0.00% x 1.00 sw6::fq::bench_fq_repr_mul2 13 13 0 0.00% x 1.00 sw6::fq::bench_fq_repr_num_bits 3 3 0 0.00% x 1.00 sw6::fq::bench_fq_repr_sub_noborrow 19 19 0 0.00% x 1.00 sw6::fq::bench_fq_sqrt 545,876 598,329 52,453 9.61% x 0.91 sw6::fq::bench_fq_square 229 254 25 10.92% x 0.90 sw6::fq::bench_fq_sub_assign 21 21 0 0.00% x 1.00 ```

The blog post indicates that FIPS/CIOS is actually faster when number of limbs > 11. It seems that in the go impl, they conditionally use the new impl only for fields where num_limbs < 11: https://github.com/ConsenSys/goff/blob/master/cmd/internal/template/element/mul.go

We can achieve something similar as follows:

macro_rules! impl_field_mul_in_place {
    ($limbs: expr, OPT) => { // goff impl }
    ($limbs: expr, FIPS) => { // old impl }
}

And then we use the appropriate option for each field size.

Pratyush commented 4 years ago

Thanks for the thorough work on this =)

jon-chuang commented 4 years ago

Dear Pratyush, contrary to your results, I obtained a consistent 6-8% improvement on all mul_assigns. Here I selected for diffs greater than 5%. All 6 mul_assigns have negative diff < -5.7%. This is consistent with results I reported previously.

I am running an Intel I7-7700HQ 2.8GHz/3.8GHz Turbo.

Benchmark

``` name bls12_377::fq::bench_fq_add_assign bls12_377::fq::bench_fq_double bls12_377::fq::bench_fq_from_repr bls12_377::fq::bench_fq_into_repr bls12_377::fq::bench_fq_inverse bls12_377::fq::bench_fq_mul_assign bls12_377::fq::bench_fq_repr_add_nocarry 10 bls12_377::fq::bench_fq_repr_mul2 bls12_377::fq::bench_fq_sqrt bls12_377::fq::bench_fq_square bls12_377::fq::bench_fq_sub_assign bls12_377::fr::bench_fr_from_repr bls12_377::fr::bench_fr_mul_assign bls12_377::fr::bench_fr_negate bls12_377::fr::bench_fr_repr_num_bits bls12_377::fr::bench_fr_sqrt bls12_381::fq::bench_fq_add_assign bls12_381::fq::bench_fq_double bls12_381::fq::bench_fq_from_repr bls12_381::fq::bench_fq_into_repr bls12_381::fq::bench_fq_mul_assign bls12_381::fq::bench_fq_negate bls12_381::fq::bench_fq_repr_add_nocarry 10 bls12_381::fq::bench_fq_repr_div2 bls12_381::fq::bench_fq_square bls12_381::fq::bench_fq_sub_assign bls12_381::fr::bench_fr_mul_assign bls12_381::fr::bench_fr_negate bls12_381::fr::bench_fr_repr_num_bits sw6::fq::bench_fq_add_assign sw6::fq::bench_fq_into_repr sw6::fq::bench_fq_mul_assign sw6::fq::bench_fq_repr_sub_noborrow sw6::fq::bench_fq_square sw6::fq::bench_fq_sub_assign sw6::fr::bench_fr_add_assign sw6::fr::bench_fr_into_repr sw6::fr::bench_fr_mul_assign sw6::fr::bench_fr_square ``` zexe-old ns/iter variable ns/iter diff ns/iter diff % speedup 11 10 -1 -9.09% x 1.10 15 14 -1 -6.67% x 1.07 66 59 -7 -10.61% x 1.12 37 34 -3 -8.11% x 1.09 18,851 20,212 1,361 7.22% x 0.93 65 60 -5 -7.69% x 1.08 11 1 10.00% x 0.91 12 13 1 8.33% x 0.92 83,874 89,156 5,282 6.30% x 0.94 57 60 3 5.26% x 0.95 12 11 -1 -8.33% x 1.09 35 37 2 5.71% x 0.95 35 33 -2 -5.71% x 1.06 8 9 1 12.50% x 0.89 3 4 1 33.33% x 0.75 32,746 35,342 2,596 7.93% x 0.93 11 10 -1 -9.09% x 1.10 15 16 1 6.67% x 0.94 67 61 -6 -8.96% x 1.10 37 32 -5 -13.51% x 1.16 68 64 -4 -5.88% x 1.06 14 15 1 7.14% x 0.93 11 1 10.00% x 0.91 7 6 -1 -14.29% x 1.17 56 62 6 10.71% x 0.90 12 11 -1 -8.33% x 1.09 35 33 -2 -5.71% x 1.06 8 7 -1 -12.50% x 1.14 3 4 1 33.33% x 0.75 20 22 2 10.00% x 0.91 152 134 -18 -11.84% x 1.13 274 251 -23 -8.39% x 1.09 18 17 -1 -5.56% x 1.06 230 249 19 8.26% x 0.92 20 19 -1 -5.00% x 1.05 11 10 -1 -9.09% x 1.10 36 32 -4 -11.11% x 1.12 62 57 -5 -8.06% x 1.09 57 63 6 10.53% x 0.90

jon-chuang commented 4 years ago

I checked with cargo-expand. Turns out that the way the macro was written was not allowing unroll to unroll some of the loops for the squaring. Making some changes, it's now faster.

Pratyush commented 4 years ago

This looks great! Thanks for investigating! Quick question: do you think it makes sense to implement the goff faster squaring code in this PR too? Or do you think it makes sense to push that off to a future PR?

I think we can have an implementation supporting all moduli as follows:

fn square_in_place(&mut self) -> &mut Self {
    // I believe this is the correct condition for the Goff algorithm
    if $FpParameters::MODULUS.0[$limbs-1] >> 62 == 0 {
        // Fast GOFF impl
    } else {
        // Existing impl
    }
}

The compiler should be able to optimize out the unused branch.

jon-chuang commented 4 years ago

My naive attempt to write the goff squaring resulted in a much slower implementation. I don't think I got it to pass the tests either. In the first place, our current squarings are much faster than goff squarings I believe.

Further, what goff found for more than 11 limbs does not quite apply to our implementation.

Personally, I am satisfied with what has been done for this PR.

Pratyush commented 4 years ago

Alright, I'll merge this PR then. Thanks!

jon-chuang commented 4 years ago

Hi Pratyush, on hindsight, I decided to add modulus checks (it checks if last bit is 1, or if else if the remaining bits are all set). I benchmarked and it's working tip top.

I just haven't written any tests to check that it works for primes that cannot be optimised with no carry. I think I should try it for a pseudomersenne prime. But I've done a few sanity checks that ensure it's doing the right thing.

By the way, is there any functionality to do prime checking at compile time? If not I don't mind adding this using elliptic curve primality testing. If not, is there a framework for generating primes?

It's just additional functionality, since the elliptic curve framework is already there. I am not sure how to do a compile time check that does not lead to a bloat in the binary however.

arkworks-rs / snark

Goff for mul_assign + field implementation macro + cheaper into_repr #163