Implement ASM for field ops/Arithmetic optimisations

jon-chuang commented 4 years ago

It seems unacceptable that a primitive that can be accelerated (leading to ~1.5x speedup) is not.

Hence I suggest to look into how one should add assembly for field ops to make used of vectorised instructions etc. I also suggest to look into other modes of multiplication besides textbook such as karatsuba multiplication. This may not lead to huge speedups on x86_64, but might on other platforms. Further one should look into optimisations for faster extension field arithmetic.

Finally, I propose to look into how much of this code can be generically created for fields of arbitrary size using either macros or function calls to wrapped assembly, with target-dependent compilation.

The current pairing timings are about 2-2.5ms for the BLS12 curves on a 3+GHz machine. I believe this can be improved closer to the theoretical limit of roughly 1.1-1.3ms. As a target, I would set 1.5ms.

[ ] x86_64
[ ] ARM
[ ] AMD64 (lower priority)
[ ] Karatsuba Multiplication
[ ] Explore faster extension field arithmetic
[ ] Macro-based code gen/ASM gen

Pratyush commented 4 years ago

If you'd like to implement this, please go ahead, but I'd prefer to merge this behind a feature if we're going to be using unsafe code.

jon-chuang commented 4 years ago

Sure, I can abide. By the way, there seems to be a discrepancy between the full pairing timings for BLS12-381 (~2ms) and BLS12-377 (~2.5ms). I have replicated it 3 times. This does not quite stand to reason, as the underlying fields are the same size. Do you know of possible reasons?

Pratyush commented 4 years ago

The curves are of different types (M-type vs D-type), and so have different final exponentiation algorithms.

Pratyush commented 4 years ago

btw, you don't necessarily need to drop down to assembly to utilize SIMD instructions; these are available as intrinsics here, for example here: https://doc.rust-lang.org/1.29.1/core/arch/x86_64/index.html

jon-chuang commented 4 years ago

I see. I will try to study this. I noticed the miller loop for 377 is slower too. Is there a reason why Zexe uses the 377 rather than 381 curve? Is this to have a power of 2 subgroup in both p and r? Perhaps I should read the Zexe paper more carefully.

I will test out using the SIMD instructions provided first. From what I gather, it's still unsafe rust.

jon-chuang commented 4 years ago

I've looked into things - SIMD isn't efficient for small numbers unless one uses an unsaturated representation that does not work for general moduli. Nonetheless, one could possibly try to rewrite point addition formulas to take advantage of SIMD, rearranging ops to avoid data dependencies that can cause clock speeds to throttle, by parallelising at the data level, rather than the representation (add 4 field elements at the same time etc), but this is a lot of manual work. There is some work on 32-bit SIMD which seems performant that I will try to review more carefully. In particular, I will try to take a look if strategies used to target NEON can be used for WASM SIMD.

The Rust std_arch intrinsics do not support Intel's ADX multiprecision add and carry (adc) instructions for now, so we will have to resort to writing assembly. Since it is difficult to maintain this assembly for all the given field sizes, I will try to do code gen with macros and focus on schoolbook Montgomery with goff's optimisation. I will try to get similar or better performance to goff.

Karatsuba might be reserved for curves over larger base field sizes like the SW ans BW curves and will be prioritised for later.

Pratyush commented 4 years ago

Thanks, this sounds like good plan!

jon-chuang commented 4 years ago

As some updates, I managed to integrate Fp256 assembly leading to about 25% speedup in Fp256 mults. But it is proving harder to write generic assembly code than expected, and the inline assembly module doesn't really play nice with catching errors and doing what it's supposed to.

I am going to try to simplify my code by writing smaller chunks of inline assembly that are easier to verify correct in a piecemeal fashion and also hopefully still allows the rust compiler to reason about memory fetches better than I can. Hopefully, I will be able to tweak both the rust and assembly sides of the code with careful benchmarking.

Also, I should use the following tool for benchmarking clock cycles: https://git.zx2c4.com/kbench9000/about/. But it may only work with C? In any case, it seems better practice to benchmark clock cycles than time because there really is a lot of clock speed volatility across time, which is undesirable. I will look into what Rust tooling exists.

Pratyush commented 4 years ago

Re: benchmarking in terms of cycles, one way could be to look into how perf.rust-lang.org counts number of instructions instead of wall-clock time.

Pratyush commented 3 years ago

@jon-chuang what's the status of this? Should this be closed?

arkworks-rs / algebra

Implement ASM for field ops/Arithmetic optimisations #78