Closed jon-chuang closed 4 years ago
I can't fix the current bug, seems to be related to trying to unwrap value that evaluates to None
Edit: the bug I was referring to was the DPC test failure.
Is the rustfmt error (Travis) and the dpc integration test (local) error to be ignored? I get the following locally
running 1 test
test dpc::plain_dpc::test::test_execute_constraint_systems ... FAILED
failures:
---- dpc::plain_dpc::test::test_execute_constraint_systems stdout ----
thread 'dpc::plain_dpc::test::test_execute_constraint_systems' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/5e1a799842ba6ed4a57e91f7ab
9435947482f7d8/src/libcore/macros/mod.rs:15:40
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Hmm I'll investigate this test in more detail.
That test is failing also on master, so it's not an issue with this PR.
Ok, will it be merged then?
Yeah, I'll fix the test in another PR. Adding another couple of comments here in a second.
I ran some benchmarks comparing this branch to master, and it seems that the differences are mostly noise:
Some of the difference could stem from the fact that the actual implementation by the goff
folks has special cases for the bottom-most and top-most limbs, and also for the cases where some carries are zero: https://github.com/ConsenSys/goff/blob/master/cmd/internal/template/element/mul_nocarry.go
I did try to implement for such special cases but did not get any significant difference in performance. My thoughts were that such special cases when particular values were 0 were optimised away by the compiler. It's clear that with optimisations I applied, into_repr
is about 1.1x faster, however.
I may try again. However, before that, I'd like to implement some macros to generate the 1000-fold iter benchmarks I need.
Furthermore, it's not clear that the goff optimisation would have produced faster code, since the original code was SOS (I think), while goff improved upon CIOS.
What is the command to run the comparative benchmark?
Hmm it also seems that for the case of sw6::Fq, there is a ~10% slowdown in performance:
The blog post indicates that FIPS/CIOS is actually faster when number of limbs > 11. It seems that in the go impl, they conditionally use the new impl only for fields where num_limbs < 11: https://github.com/ConsenSys/goff/blob/master/cmd/internal/template/element/mul.go
We can achieve something similar as follows:
macro_rules! impl_field_mul_in_place {
($limbs: expr, OPT) => { // goff impl }
($limbs: expr, FIPS) => { // old impl }
}
And then we use the appropriate option for each field size.
Thanks for the thorough work on this =)
Dear Pratyush, contrary to your results, I obtained a consistent 6-8% improvement on all mul_assign
s. Here I selected for diffs greater than 5%. All 6 mul_assign
s have negative diff < -5.7%. This is consistent with results I reported previously.
I am running an Intel I7-7700HQ 2.8GHz/3.8GHz Turbo.
``` name zexe-old ns/iter variable ns/iter diff ns/iter diff % speedup bls12_377::fq::bench_fq_add_assign 11 10 -1 -9.09% x 1.10 bls12_377::fq::bench_fq_double 15 14 -1 -6.67% x 1.07 bls12_377::fq::bench_fq_from_repr 66 59 -7 -10.61% x 1.12 bls12_377::fq::bench_fq_into_repr 37 34 -3 -8.11% x 1.09 bls12_377::fq::bench_fq_inverse 18,851 20,212 1,361 7.22% x 0.93 bls12_377::fq::bench_fq_mul_assign 65 60 -5 -7.69% x 1.08 bls12_377::fq::bench_fq_repr_add_nocarry 10 11 1 10.00% x 0.91 bls12_377::fq::bench_fq_repr_mul2 12 13 1 8.33% x 0.92 bls12_377::fq::bench_fq_sqrt 83,874 89,156 5,282 6.30% x 0.94 bls12_377::fq::bench_fq_square 57 60 3 5.26% x 0.95 bls12_377::fq::bench_fq_sub_assign 12 11 -1 -8.33% x 1.09 bls12_377::fr::bench_fr_from_repr 35 37 2 5.71% x 0.95 bls12_377::fr::bench_fr_mul_assign 35 33 -2 -5.71% x 1.06 bls12_377::fr::bench_fr_negate 8 9 1 12.50% x 0.89 bls12_377::fr::bench_fr_repr_num_bits 3 4 1 33.33% x 0.75 bls12_377::fr::bench_fr_sqrt 32,746 35,342 2,596 7.93% x 0.93 bls12_381::fq::bench_fq_add_assign 11 10 -1 -9.09% x 1.10 bls12_381::fq::bench_fq_double 15 16 1 6.67% x 0.94 bls12_381::fq::bench_fq_from_repr 67 61 -6 -8.96% x 1.10 bls12_381::fq::bench_fq_into_repr 37 32 -5 -13.51% x 1.16 bls12_381::fq::bench_fq_mul_assign 68 64 -4 -5.88% x 1.06 bls12_381::fq::bench_fq_negate 14 15 1 7.14% x 0.93 bls12_381::fq::bench_fq_repr_add_nocarry 10 11 1 10.00% x 0.91 bls12_381::fq::bench_fq_repr_div2 7 6 -1 -14.29% x 1.17 bls12_381::fq::bench_fq_square 56 62 6 10.71% x 0.90 bls12_381::fq::bench_fq_sub_assign 12 11 -1 -8.33% x 1.09 bls12_381::fr::bench_fr_mul_assign 35 33 -2 -5.71% x 1.06 bls12_381::fr::bench_fr_negate 8 7 -1 -12.50% x 1.14 bls12_381::fr::bench_fr_repr_num_bits 3 4 1 33.33% x 0.75 sw6::fq::bench_fq_add_assign 20 22 2 10.00% x 0.91 sw6::fq::bench_fq_into_repr 152 134 -18 -11.84% x 1.13 sw6::fq::bench_fq_mul_assign 274 251 -23 -8.39% x 1.09 sw6::fq::bench_fq_repr_sub_noborrow 18 17 -1 -5.56% x 1.06 sw6::fq::bench_fq_square 230 249 19 8.26% x 0.92 sw6::fq::bench_fq_sub_assign 20 19 -1 -5.00% x 1.05 sw6::fr::bench_fr_add_assign 11 10 -1 -9.09% x 1.10 sw6::fr::bench_fr_into_repr 36 32 -4 -11.11% x 1.12 sw6::fr::bench_fr_mul_assign 62 57 -5 -8.06% x 1.09 sw6::fr::bench_fr_square 57 63 6 10.53% x 0.90 ```
I checked with cargo-expand. Turns out that the way the macro was written was not allowing unroll
to unroll some of the loops for the squaring. Making some changes, it's now faster.
This looks great! Thanks for investigating! Quick question: do you think it makes sense to implement the goff
faster squaring code in this PR too? Or do you think it makes sense to push that off to a future PR?
I think we can have an implementation supporting all moduli as follows:
fn square_in_place(&mut self) -> &mut Self {
// I believe this is the correct condition for the Goff algorithm
if $FpParameters::MODULUS.0[$limbs-1] >> 62 == 0 {
// Fast GOFF impl
} else {
// Existing impl
}
}
The compiler should be able to optimize out the unused branch.
My naive attempt to write the goff squaring resulted in a much slower implementation. I don't think I got it to pass the tests either. In the first place, our current squarings are much faster than goff squarings I believe.
Further, what goff found for more than 11 limbs does not quite apply to our implementation.
Personally, I am satisfied with what has been done for this PR.
Alright, I'll merge this PR then. Thanks!
Hi Pratyush, on hindsight, I decided to add modulus checks (it checks if last bit is 1, or if else if the remaining bits are all set). I benchmarked and it's working tip top.
I just haven't written any tests to check that it works for primes that cannot be optimised with no carry. I think I should try it for a pseudomersenne prime. But I've done a few sanity checks that ensure it's doing the right thing.
By the way, is there any functionality to do prime checking at compile time? If not I don't mind adding this using elliptic curve primality testing. If not, is there a framework for generating primes?
It's just additional functionality, since the elliptic curve framework is already there. I am not sure how to do a compile time check that does not lead to a bloat in the binary however.
Added goff (for
mul_assign
only) + macros to simplify code Have not added requirements on the prime fields/alternate montgomery multiplication for primes not satisfying requirements (https://hackmd.io/@zkteam/modular_multiplication)