for a 6 word modulus (bls377, bls381):
cycles count: 121 (vs 125 for herumi/mcl)
benchmark old ns/op new ns/op delta
BenchmarkMulAssignELEMENT-8 53.9 40.7 -24.49%
BenchmarkMulAssignELEMENT-8 54.0 40.9 -24.26%
BenchmarkMulAssignELEMENT-8 54.0 40.3 -25.37%
internal/template/asm/mul.go contains code generation for the amd64 multiplication. It checks for availability of ADDX and MULX and jump to the standard version if not present.
// for i=0 to N-1
// for j=0 to N-1
// (A,t[j]) := t[j] + a[j]*b[i] + A
// m := t[0]*q'[0] mod W
// C,_ := t[0] + m*q[0]
// for j=1 to N-1
// (C,t[j-1]) := t[j] + m*q[j] + C
// t[N-1] = C + A
As described in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
This new version is constant time.
Implemented only when nbWords <= 6; need to manage register spilling into memory for larger moduli.
Squaring has an ASM implementation in this branch that's not plugged, as it is currently slower than MUL
Preliminary benchmarks: for a 4 word modulus (bn256):
for a 6 word modulus (bls377, bls381): cycles count: 121 (vs 125 for
herumi/mcl
)internal/template/asm/mul.go
contains code generation for the amd64 multiplication. It checks for availability of ADDX and MULX and jump to the standard version if not present.To benefit from the ADCX and ADOX carry chains we split the inner loops in 2 (https://hackmd.io/@zkteam/modular_multiplication) :