Open HarryR opened 5 years ago
Heya, this is awesome research, thanks for this. That intel paper is really handy.
Out of curiosity, are you thinking of refactoring the libff subroutines to use the ADCX/ADOX/MULX multiplication chain, or linking the subroutines used in barretenberg?
I'm thinking of refactoring the libff
subroutines to use ADCX/ADOX/MULX (with load buffering, then interleaving), via a compile-time flag like HAVE_ADCX
.
I need some time and to complete some things before I get onto the task though, but it looks like a reasonable speedup.
Also, I think I found the slowdown with GMP on WASM, the fallback method uses the naïve modulo powering method rather than montgomery (and also hits a lot of bounds-checks etc.), so should be able to get that sorted out too which should reduce my WASM proving time.
I will let you know when I start making progress.
I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)
However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf
This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt
libff
with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the
G(p)
field overG(2^n)
(I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)You would assume that an optimising compiler could produce something similar to the following for
fq_impl_int128
:Karatsuba Multiplication 128-bit x 128-bit = 256-bit
Karatsuba Multiplication 256-bit x 256-bit = 512-bit
Quick Squaring (256-bit)^2 = 512-bit