Assembly for Arm v8.5-A ISA

I'm sure it has gotten the attention of everyone that Apple's M-chips are basically as fast as the state-of-the-art x86 processors (see GMP's benchmark results). Therefore, I think we should implement assembly routines for these ones as well.

These are the current routines that should be implemented:

[x] Hard(ish)coded multiplication (treated in #1808, works as a full replacement for mpn_mul_basecase)
[x] Hardcoded squaring (treated in #1912)
[x] Hardcoded high multiplication (treated in #1912)
[x] Hardcoded high squaring (treated in #1912)
[x] High multiplication, basecase (treated in #1912)
[ ] High squaring, basecase
[ ] Hardcoded low multiplication
[ ] Hardcoded low squaring
[ ] Low multiplication, basecase
[ ] Low squaring, basecase

Useful links:

https://dougallj.github.io/applecpu/firestorm.html
https://dougallj.github.io/applecpu/firestorm-int.html
https://dougallj.github.io/applecpu/firestorm-simd.html
https://developer.arm.com/architectures/instruction-sets/intrinsics/
https://developer.arm.com/documentation/ddi0602/2023-12?lang=en
https://github.com/corsix/amx
https://stackoverflow.com/questions/70717360/how-to-load-vector-registers-from-integer-registers-in-arm64-m1

flintlib / flint

Assembly for Arm v8.5-A ISA #1806