Closed mrmr1993 closed 4 years ago
One can use prefetching to improve the multiexp cpu util. In my experience, hyperthreading on also helps. However, one does see better gains for greater number of integer limbs.
I am also thinking that the main bottleneck to the algorithm is the indexing. Aztec solves this using a in-place radix sort. Hopefully I will have time to rewrite my implementation again. The first time I rewrote it to be more in-place, I got a sizeable speed bump. I think radix sort might be good due to cache locality.
Managed to use the radix sort and got a bump over the non-affine version :) Stay tuned for updates on improving hardware prefetching by writing the data in a nice way.
Looks like we've achieved a decent 1.3-4x bump for BN254 running asm, at 8.2s for 2^24 elems, or about 0.5s for 2^20 elems on 8 cores with a relatively small L3 cache: https://github.com/celo-org/zexe/pull/4. For larger curves with longer field arithmetic, we get quite close to the theoretical 1.83x, at 1.69x
This branch contains the changes necessary to work with the coda repo and the updated marlin at o1-labs/marlin#49.