Tuning for Apple M1 AMX2 coprocessor

zinphi commented 2 years ago

I know that there is already a similar discussion for the general tuning on Apple M1 chips (see https://github.com/xianyi/OpenBLAS/issues/2814) but I wanted to revive this topic a little bit (focusing on Apple's AMX2 extension). I guess, regarding the M1 ARMv8 support, OpenBLAS already works pretty well and I think you did a very good job on that! Nevertheless, we know by now that Apple's AMX2 extension (e.g. through vecLib) allows probably an even higher performance by simultaneously achieving a dramatically improved efficiency. If I understood correctly, so far, the main problem for adopting AMX2 instructions has been the missing documentation from Apple. However, even without the help of Apple, it seems that in the last two years those instructions have been decrypted (and unofficially documented) nearly completely (at least to my understanding, e.g., see https://github.com/corsix/amx#readme). Thus, my question: Having this knowledge now, would it be possible to implement appropriate (Apple M1) AMX2 kernels in OpenBLAS so that we can achieve the same or an even better performance/efficiency than vecLib?

martin-frbg commented 2 years ago

Same legal minefield as Dougall Johnson's reverse engineering work that was already suggested in JuliaLang/julia#2814 (and appears in the references/links document of corsix), and it does not appear as if anybody else (e.g. gcc) is basing M1 code on this currently either. So not really keen to touch this.

zinphi commented 2 years ago

Idk about the legal aspects of using undocumented upcodes. From my perspective as a layman, I would guess you are allowed to run any kind of code on a PC you bought. And I think in all open source licenses are legal disclaimer included which should release you from all liabilities if something goes south. Has anybody maybe the possibility to ask an expert on this legal question? I know that corsix referenced to the gist that was already suggested. It is surely no ‚official‘ documentation there but, in contrast to the previously suggested gist, corsix provides a documentation which is probably as complete and extensive as it can be. Furthermore corsix provides test routines to all commands which proof their functionality.

brada4 commented 2 years ago

You have Accelerate framework from Apple which uses secret co-processor. Not even xcode clang has any support for those instructions. I dont see even disassembler anywhere.

zinphi commented 2 years ago

The thing is that the Accelerate framework just provides LP64 BLAS/LAPACK and some software relies on an ILP64 implementation. E.g, Julia (see https://github.com/JuliaLang/LinearAlgebra.jl/issues/869) uses per default OpenBLAS on all platforms but allows also to plug in other BLAS/LAPACK implementations as long as they provide ILP64 routines. This basically prevents the use of vecLib in Julia. On the other hand, if OpenBLAS would have the same performance/efficiency as vecLib, there would be no need to switch to the Accelerate framework in the first place - this was also my motivation for posting this idea here since it would be the most elegant way to resolve this issue IMHO (from a Julia perspective)...

However, you seem doubtful regarding using the AMX extension directly and I understand your point of view from a legal perspective. Since you mentioned the Accelerate framework, here an alternative idea: Would it be possible (or reasonable, Idk the structure and paradigms of OpenBLAS) to simply write a wrapper to call those few BLAS level 3 routines which strongly benefit from AMX code directly from vecLib BLAS, preserving the ILP64 interface and all other OpenBLAS conventions?

brada4 commented 2 years ago

Currently AMX co-processor is undocumented and uninstrumented. It is quite legal to tinker, though nobody prevents OEM to microcode your toys away. Wrapper can be written at ILP64->LP64 level - just check if int64 arguments would fit in int32 and call accelerate. otherwise int64 openblas. That switchover point is int32 sized bunch of floats, i.e 16GB and by magnitude bigger matrices, impractical on current M1 systems. For all practical purposes you double speed on current M1 systems while maintaining purity of int64 internals in Julia, and fill the gap to instrumentation or obsolescence of AMX.

zinphi commented 2 years ago

Sure, the vecLib BLAS wrapper solution would accelerate at least matrix multiplications. But, if I understood correctly, OpenBLAS uses internally optimized kernels to accelerate things such as some LAPACK routines and other operations. Obviously, by using BLAS wrappers one would not have the benefit there. Additionally, the BLAS version/behavior might differ in the end as the vecLib and openBLAS implementation might diverge at some point. Thus, I think, in the end, the only 'clean' solution would be to have those optimized kernels in OpenBLAS which benefit from Apple's AMX extension...

Recently, I stumbled over Apple's SIMD API which is also part of the accelerate framework: https://developer.apple.com/documentation/accelerate/simd. It looks to me as those SIMD instructions have quite the same functionality as openBLAS kernels (e.g., simd_mul() on 4x4 double matrices, please correct me if I'm wrong). Hence, an other idea: wouldn't it be possible to interface these SIMD instructions to OpenBLAS kernels? I strongly assume these SIMD functions use the AMX backend. If this strategy would be feasible, OpenBLAS could be optimized for several (future) Apple hardware products without larger modifications since Apple will probably optimize these instructions for its individual platforms.

brada4 commented 2 years ago

Actually dubiously named co-procesor is not mentioned in that abstraction API anywhere. Or by Apple.

Though you can try yourself if 4x4 * 4x1 generates any non-disassemblable functions or calls deep in accelerate libraries.

OpenMathLib / OpenBLAS

Tuning for Apple M1 AMX2 coprocessor #3789