The sgemm avx kernel preferred A B → C with C row major by default, but if
C is column major (and not row major, or a custom strided layout), we
can simply compute the transpose BT AT → CT instead.
This removes the row major bias in the layout benchmarks (sgemm avx), and we are
equally fast on f and c outputs:
(The notation we use is "c" for row major and "f" for column major.
For example "fcf" means that A is column major, B is row major, C is column major in matrix mult A B → C)
The sgemm avx kernel preferred A B → C with C row major by default, but if C is column major (and not row major, or a custom strided layout), we can simply compute the transpose BT AT → CT instead.
This removes the row major bias in the layout benchmarks (sgemm avx), and we are equally fast on f and c outputs:
(The notation we use is "c" for row major and "f" for column major. For example "fcf" means that A is column major, B is row major, C is column major in matrix mult A B → C)
Basically, just take up the offer about transpose that was already in the code.
Regular (row major only) benchmarks show no or minuscle change:
The uneven cases even improve because the masked kernel is column major by default (yet another thing to tweak).