In the sgemm avx kernel, transpose if we can match C's layout

The sgemm avx kernel preferred A B → C with C row major by default, but if C is column major (and not row major, or a custom strided layout), we can simply compute the transpose B^T A^T → C^T instead.

This removes the row major bias in the layout benchmarks (sgemm avx), and we are equally fast on f and c outputs:

(The notation we use is "c" for row major and "f" for column major. For example "fcf" means that A is column major, B is row major, C is column major in matrix mult A B → C)

 name                 63 ns/iter  62 ns/iter  diff ns/iter   diff %
 layout_f32_032::ccc  2,032       2,033                  1    0.05%
 layout_f32_032::ccf  2,279       2,026               -253  -11.10%
 layout_f32_032::cfc  2,275       2,291                 16    0.70%
 layout_f32_032::cff  2,532       2,288               -244   -9.64%
 layout_f32_032::fcc  1,783       1,778                 -5   -0.28%
 layout_f32_032::fcf  2,046       1,787               -259  -12.66%
 layout_f32_032::ffc  2,020       2,029                  9    0.45%
 layout_f32_032::fff  2,301       2,035               -266  -11.56%

Basically, just take up the offer about transpose that was already in the code.

Regular (row major only) benchmarks show no or minuscle change:

 mat_mul_f32::m004  227         215                  -12  -5.29% 
 mat_mul_f32::m006  256         243                  -13  -5.08% 
 mat_mul_f32::m008  201         203                    2   1.00% 
 mat_mul_f32::m012  568         521                  -47  -8.27% 
 mat_mul_f32::m016  462         469                    7   1.52% 
 mat_mul_f32::m127  95,092      95,351               259   0.27%

The uneven cases even improve because the masked kernel is column major by default (yet another thing to tweak).

bluss / matrixmultiply

In the sgemm avx kernel, transpose if we can match C's layout #27