Research notes, SIMD implementation, faster multiplication and `GF(2^m)`

HarryR commented 5 years ago

I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)

However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf

This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt libff with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.

From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the G(p) field over G(2^n) (I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)

You would assume that an optimising compiler could produce something similar to the following for fq_impl_int128:

Karatsuba Multiplication 128-bit x 128-bit = 256-bit

// c1c0 = a * b
void GF2m_mul_2x2_xmm(__m128i *c1, __m128i *c0, __m128i b,
__m128i a)
{
 __m128i t1, t2;
 *c0 = _mm_clmulepi64_si128(a, b, 0x00);
 *c1 = _mm_clmulepi64_si128(a, b, 0x11);
 t1 = _mm_shuffle_epi32(a, 0xEE);
 t1 = _mm_xor_si128(a, t1);
 t2 = _mm_shuffle_epi32(b, 0xEE);
 t2 = _mm_xor_si128(b, t2);
 t1 = _mm_clmulepi64_si128(t1, t2, 0x00);
 t1 = _mm_xor_si128(*c0, t1);
 t1 = _mm_xor_si128(*c1, t1);
 t2 = t1;
 t1 = _mm_slli_si128(t1, 8);
 t2 = _mm_srli_si128(t2, 8);
 *c0 = _mm_xor_si128(*c0, t1);
 *c1 = _mm_xor_si128(*c1, t2);
}

Karatsuba Multiplication 256-bit x 256-bit = 512-bit

// c3c2c1c0 = a1a0 * b1b0
{
 a1 = _mm_set_epi64x(a->d[3], a->d[2]);
 a0 = _mm_set_epi64x(a->d[1], a->d[0]);
 b1 = _mm_set_epi64x(b->d[3], b->d[2]);
 b0 = _mm_set_epi64x(b->d[1], b->d[0]);
 GF2m_mul_2x2_xmm(&c1, &c0, a0, b0);
 GF2m_mul_2x2_xmm(&c3, &c2, a1, b1);
 a0 = _mm_xor_si128(a0, a1);
 b0 = _mm_xor_si128(b0, b1);
 GF2m_mul_2x2_xmm(&c5, &c4, a0, b0);
 c4 = _mm_xor_si128(c4, c0);
 c4 = _mm_xor_si128(c4, c2);
 c5 = _mm_xor_si128(c5, c1);
 c5 = _mm_xor_si128(c5, c3);
 c1 = _mm_xor_si128(c1, c4);
 c2 = _mm_xor_si128(c2, c5);
}

Quick Squaring (256-bit)^2 = 512-bit

// c3c2c1c0 = a1a0 * a1a0
{
 c0 = _mm_clmulepi64_si128(a0, a0, 0x00);
 c1 = _mm_clmulepi64_si128(a0, a0, 0x11);
 c2 = _mm_clmulepi64_si128(a1, a1, 0x00);
 c3 = _mm_clmulepi64_si128(a1, a1, 0x11);
}

zac-williamson commented 5 years ago

Heya, this is awesome research, thanks for this. That intel paper is really handy.

Out of curiosity, are you thinking of refactoring the libff subroutines to use the ADCX/ADOX/MULX multiplication chain, or linking the subroutines used in barretenberg?

HarryR commented 5 years ago

I'm thinking of refactoring the libff subroutines to use ADCX/ADOX/MULX (with load buffering, then interleaving), via a compile-time flag like HAVE_ADCX.

I need some time and to complete some things before I get onto the task though, but it looks like a reasonable speedup.

Also, I think I found the slowdown with GMP on WASM, the fallback method uses the naïve modulo powering method rather than montgomery (and also hits a lot of bounds-checks etc.), so should be able to get that sorted out too which should reduce my WASM proving time.

I will let you know when I start making progress.

AztecProtocol / aztec-2.0