flintlib / flint

FLINT (Fast Library for Number Theory)
http://www.flintlib.org
GNU Lesser General Public License v3.0
445 stars 244 forks source link

Assembly for Arm v8.5-A ISA #1806

Open albinahlback opened 9 months ago

albinahlback commented 9 months ago

I'm sure it has gotten the attention of everyone that Apple's M-chips are basically as fast as the state-of-the-art x86 processors (see GMP's benchmark results). Therefore, I think we should implement assembly routines for these ones as well.

These are the current routines that should be implemented:

Useful links:

  1. https://dougallj.github.io/applecpu/firestorm.html
  2. https://dougallj.github.io/applecpu/firestorm-int.html
  3. https://dougallj.github.io/applecpu/firestorm-simd.html
  4. https://developer.arm.com/architectures/instruction-sets/intrinsics/
  5. https://developer.arm.com/documentation/ddi0602/2023-12?lang=en
  6. https://github.com/corsix/amx
  7. https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1
albinahlback commented 8 months ago

Currently on my arm_assembly branch:

mpn_mul vs flint_mpn_mul

m =   1: 4.67
m =   2: 4.68 3.61
m =   3: 4.01 3.30 3.04
m =   4: 2.89 2.39 2.27 2.18
m =   5: 3.03 2.21 1.95 2.02 2.04
m =   6: 2.64 1.97 1.82 1.89 2.18 2.05
m =   7: 2.32 1.79 1.99 1.68 1.76 1.79 1.83
m =   8: 2.13 1.69 1.61 1.59 1.70 1.74 1.81 1.79
m =   9: 1.96 1.63 1.57 1.53 1.63 1.64 1.64 1.71 1.77
m =  10: 1.81 1.49 1.48 1.47 1.51 1.63 1.60 1.69 1.73 1.75
m =  11: 1.75 1.50 1.45 1.46 1.45 1.48 1.51 1.53 1.57 1.56 1.58
m =  12: 1.63 1.37 1.39 1.47 1.51 1.57 1.69 1.78 1.67 1.58 1.58 1.61

Tested on cfarm103 (Apple M1)