Implementation of Windowed NAF multiplication

Also includes an initial re-implementation of Montgomery form operations (for affine points) based on the rust zcash code, however this requires field inversions so isn't very performant - the idea is to combine wNAF with projective montgomery coordinates to see if the number of operations can be reduced even further.

If projective montgomery coordinates don't work very well in practice this will still provide a general speedup which can be implemented in the Solidity code to reduce gas cost.

We can benchmark the number of FQ operations with different windows:

>>> for i in list(range(2,8)): 
...     print("Window size =", i) 
...     FQ._reset_counts() 
...     mult_naf_lut(Point.generator().as_etec(), x, width=i) 
...     FQ._print_counts()                                                                  
Window size = 2
add = 731
mul = 3340
sub = 1306

Window size = 3
add = 687
mul = 3184
sub = 1254

Window size = 4
add = 676
mul = 3181
sub = 1266

Window size = 5
add = 746
mul = 3515
sub = 1414

Window size = 6
add = 993
mul = 4634
sub = 1882

Window size = 7
add = 1601
mul = 7370
sub = 3010

Compared to affine multiplication (with inversion):

add = 712
inv = 712
mul = 4628
sub = 712

Compared to ETEC multiplication (without inversion):

add = 819
mul = 3660
sub = 1420

Compared to NAF multiplication (without a window, just simple negation):

add = 729
mul = 3330
sub = 1378

This shows the most efficient window is 4-bits wide, which provides a ~14% reduction in the number of field operations. However, is the cost of calculating the wNAF form in Solidity/EVM low enough to justify the extra complexity, and how much will the overall gas reduction be?

Benchmarking these different methods in Python, where results are in seconds per operation, with variables:

a = Affine
b = Projective
c = extended projective
d = Montgomery affine

a * x 0.13991
b * x 0.01393
c * x 0.01457
d * x 0.07344
mult_naf(a, x) 0.12074
mult_naf(b, x) 0.01289
mult_naf(c, x) 0.01215
mult_naf(d, x) 0.06434
mult_naf_lut(a, x, 2) 0.12352
mult_naf_lut(b, x, 2) 0.01331
mult_naf_lut(c, x, 2) 0.01247
mult_naf_lut(d, x, 2) 0.06731
mult_naf_lut(a, x, 3) 0.11875
mult_naf_lut(b, x, 3) 0.01259
mult_naf_lut(c, x, 3) 0.01117
mult_naf_lut(d, x, 3) 0.06319
mult_naf_lut(a, x, 4) 0.11229
mult_naf_lut(b, x, 4) 0.0118
mult_naf_lut(c, x, 4) 0.01122
mult_naf_lut(d, x, 4) 0.06054
mult_naf_lut(a, x, 5) 0.11454
mult_naf_lut(b, x, 5) 0.01138
mult_naf_lut(c, x, 5) 0.01098
mult_naf_lut(d, x, 5) 0.06135
mult_naf_lut(a, x, 6) 0.11214
mult_naf_lut(b, x, 6) 0.01138
mult_naf_lut(c, x, 6) 0.01166
mult_naf_lut(d, x, 6) 0.06127
mult_naf_lut(a, x, 7) 0.11996
mult_naf_lut(b, x, 7) 0.01189
mult_naf_lut(c, x, 7) 0.01192
mult_naf_lut(d, x, 7) 0.06264
mult_naf_lut(a, x, 8) 0.12416
mult_naf_lut(b, x, 8) 0.01317
mult_naf_lut(c, x, 8) 0.01285
mult_naf_lut(d, x, 8) 0.06889
mult_naf_lut(a, x, 9) 0.15301
mult_naf_lut(b, x, 9) 0.0176
mult_naf_lut(c, x, 9) 0.01642
mult_naf_lut(d, x, 9) 0.08172

We can see a lookup table of 5 is the most efficient overall (in wall-clock time), and ETEC points in combination with a 5-bit w-NAF is the most efficient variant - being about 25% faster than affine coordinates without the wNAF method.

However, the trade-off between simplicity and speed between NAF and wNAF is very small, meaning wNAF only becomes faster than NAF when the windows size is ~4 or 5 bits... Otherwise NAF without the window is simpler and generally faster.

HarryR / ethsnarks

Implementation of Windowed NAF multiplication #123