Use a different pack layout for complex micorkernels which puts real and
imag parts in separate rows. This enables much better autovectorization for
the fallback kernels.
Also enable an Avx2 + Fma autovectorized kernel.
Performance improvements (all kernels autovectorized for cgemm, zgemm
at this time)
Complex (cgemm, zgemm):
Use a different pack layout for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.
Also enable an Avx2 + Fma autovectorized kernel.
Performance improvements (all kernels autovectorized for cgemm, zgemm at this time)
Float (sgemm, dgemm):
When the kernels can now select their own packing functions, instantiate an avx2 version of the general packing function for sgemm and dgemm.
Packing performance matters most for small matrix multiplications, for bigger sizes it is a vanishingly small part of runtime.