element.Mul now uses ADOX ADCX and MULX when available

As described in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
This new version is constant time.
Implemented only when nbWords <= 6; need to manage register spilling into memory for larger moduli.
Squaring has an ASM implementation in this branch that's not plugged, as it is currently slower than MUL

Preliminary benchmarks: for a 4 word modulus (bn256):

bn256 [develop] $ benchcmp old.txt new.txt 
benchmark                       old ns/op     new ns/op     delta
BenchmarkMulAssignELEMENT-8     28.1          21.8          -22.42%
BenchmarkMulAssignELEMENT-8     28.8          21.8          -24.31%
BenchmarkMulAssignELEMENT-8     28.9          21.8          -24.57%

for a 6 word modulus (bls377, bls381): cycles count: 121 (vs 125 for herumi/mcl)

benchmark                       old ns/op     new ns/op     delta
BenchmarkMulAssignELEMENT-8     53.9          40.7          -24.49%
BenchmarkMulAssignELEMENT-8     54.0          40.9          -24.26%
BenchmarkMulAssignELEMENT-8     54.0          40.3          -25.37%

internal/template/asm/mul.go contains code generation for the amd64 multiplication. It checks for availability of ADDX and MULX and jump to the standard version if not present.

To benefit from the ADCX and ADOX carry chains we split the inner loops in 2 (https://hackmd.io/@zkteam/modular_multiplication) :

    // for i=0 to N-1
    //      for j=0 to N-1
    //          (A,t[j])  := t[j] + a[j]*b[i] + A
    //      m := t[0]*q'[0] mod W
    //      C,_ := t[0] + m*q[0]
    //      for j=1 to N-1
    //          (C,t[j-1]) := t[j] + m*q[j] + C
    //      t[N-1] = C + A

Consensys / goff

element.Mul now uses ADOX ADCX and MULX when available #13