Allow operations on transposed matrices, i.e. Op(A) and Op(B), and DSYRK

SuperFluffy commented 6 years ago

I wanted to use the integer gemm code from 0430cf0, and realized that there currently is no way of performing an operation on transposed matrices while I wanted to perform A^t A. In the BLAS context, the transpose or complex conjugate of a matrix is usually expressed as Op(A), where Op is expressed through the parameter TRANSA given by the character 'N', 'T', or 'C'.

I realize that since we actually have dimensions and slices as part of our matrix ArrayBase structures, we can just circumvent the issue by doing a transpose of the matrix view via fn t(mut self). The questions are:

Performance: does code specific to a transposed matrix with unchanged memory layout have the same performance as generic code given different stride information?
In how far is the gemm kernel for transposed matrices different from that for non-transposed matrices?

While addressing this issue, it's probably also worth investigating how DSYRK for the specific case of A^t A is implemented different in the BLIS library.

I have a hard time understanding how BLIS defines its kernels, specifically how the different cases of Op(A) Op(B) are implemented. I am happy do dig in and write a benchmark comparing the current ndarray approach to writing specific kernel. Can you point me to the right spot to look at?

bluss commented 6 years ago

If you want At A, then the matrix is square and you can transpose it by swapping rsa and csa, i.e the strides.

The kernel doesn't care about layout of a and b, that is handled in packing. The kernel does write into c directly though, but that's not as time critical.

bluss commented 6 years ago

I'll be back for the other questions

bluss commented 6 years ago

The existing packing code is extremely simple but not bad, and we can look at if it can be improved for specific layout of inputs.

It's responsible for copying the active parts of A, B into a dense buffer; we copy strips of A, B that correspond to the native width/height of the kernel.

Packing should be 5-10% of the running time, maybe a bit more for tiny problems.

bluss commented 6 years ago

One packing improvement in #26, but there should be more to gain.

Given this, at the moment one should find that for a column or row major matrix A, one of A^T A and AA^T is going to be faster than the other. The one that puts the column major matrix on the left hand side.

bluss commented 6 years ago

Layout benchmarks from #26.

The notation we use is "C" for row major and "F" for column major.

For example "fcf" means that A is column major, B is row major, C is column major in matrix mult A B → C

test layout_f32::ccc   ... bench:       2,032 ns/iter (+/- 92)
test layout_f32::ccf   ... bench:       2,279 ns/iter (+/- 61)
test layout_f32::cfc   ... bench:       2,275 ns/iter (+/- 30)
test layout_f32::cff   ... bench:       2,532 ns/iter (+/- 102)
test layout_f32::fcc   ... bench:       1,783 ns/iter (+/- 60)
test layout_f32::fcf   ... bench:       2,046 ns/iter (+/- 58)
test layout_f32::ffc   ... bench:       2,020 ns/iter (+/- 28)
test layout_f32::fff   ... bench:       2,301 ns/iter (+/- 111)

Shows that the sgemm kernel is biased vs a C matrix in c-layout and biased towards A, B matrices using layout f, c respectively. So FCC is the fastest.

So once we have the benchmarks we can try to improve it.

bluss commented 6 years ago

The sgemm kernel is 8×8, so it's simple to just flip it inside there, and if we do that, we fix up the bias in the destination matrix's layout:

 name                 63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 layout_f32_032::ccc  2,032       2,033                  1    0.05% 
 layout_f32_032::ccf  2,279       2,026               -253  -11.10% 
 layout_f32_032::cfc  2,275       2,291                 16    0.70% 
 layout_f32_032::cff  2,532       2,288               -244   -9.64% 
 layout_f32_032::fcc  1,783       1,778                 -5   -0.28% 
 layout_f32_032::fcf  2,046       1,787               -259  -12.66% 
 layout_f32_032::ffc  2,020       2,029                  9    0.45% 
 layout_f32_032::fff  2,301       2,035               -266  -11.56%

Fixed in #27

Now F/C destination doesn't matter. It's only the f32 kernel that I've recently rewritten so it's the only one I know so well. Now to figure out if the f64 4 × 8 kernel can be this flexible.

bluss commented 6 years ago

The dgemm kernel doesn't show a clear bias in destination at the moment, only the same packing bias in inputs a and b. Unlike the sgemm avx kernel, dgemm kernel is just autovectorized right now.

test layout_f64_032::ccc ... bench:       4,329 ns/iter (+/- 966)
test layout_f64_032::ccf ... bench:       4,315 ns/iter (+/- 419)
test layout_f64_032::cfc ... bench:       4,327 ns/iter (+/- 126)
test layout_f64_032::cff ... bench:       4,229 ns/iter (+/- 149)
test layout_f64_032::fcc ... bench:       4,209 ns/iter (+/- 158)
test layout_f64_032::fcf ... bench:       4,206 ns/iter (+/- 233)
test layout_f64_032::ffc ... bench:       4,364 ns/iter (+/- 169)
test layout_f64_032::fff ... bench:       4,125 ns/iter (+/- 218)

bluss commented 6 years ago

@SuperFluffy At least some small improvements were inspired by the question :fireworks:

I think that we don't need the transpose arguments, because we support custom strides. So this crate can just provide the bare bones API that it does now. As we have noted though, matrixmultiply can benefit from adapting to the layout of the inputs and outputs in some places.

The microkernel uses a fixed memory layout for inputs A, B and that is arranged by the matrix packing step. Packing copies the inputs using their custom layout into the packing buffer's fixed layout. In this sense the microkernel doesn't care about the layouts.

The only time packing is redundant for a 8×8 microkernel is when we are computing a 8×K times K×8 matrix multiplication where the LHS is column major and RHS row major. Maybe that illustrates what packing does.

The packing step is also used by for example BLIS to do precomputation like complex conjugation, zero padding for matrix edges and across the diagonal for a symmetric input, and inverting (1/x) the diagonals when implementing trsm.

The microkernel is responsible for writing directly to the output C, but it only needs to do it in exactly its kernel sized chunk. So the exception is when the “masked” kernel is used, when we have an uneven part of C left (for example a border of 1-7 elements that don't fit into a 8×8 block), for those blocks only, we redirect the kernel to write to an internal buffer and then we copy the needed parts from the buffer to the real C.

Anyway, the microkernel could prefer an output of either C/F layout or just adapt and handle both. The avx sgemm kernel now handles both C/F equally well, but it has a slower fallback when the output has custom strides that don't match a contiguous layout.

BLIS says that it needs only 2 or 3 microkernels (gemm, trsm, gemmtrsm) to implement all level-3 algorithms. https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#level-3 In that sense we have what it takes for dsyrk, we have the kernel, just not an adaption of packing, kernel masking and the loops around the kernel to implement the operation. The benefit of dsyrk over dgemm should be to only compute those blocks that cover a triangular part of C.

The BLIS gemm kernel doesn't take more relevant parameters than we use, because its auxinfo is used for prefetching and cntx can be safely ignored, apparently.

https://github.com/flame/blis/blob/master/docs/KernelsHowTo.md#level-3-micro-kernels

The loops around the kernel and the packing picture from BLIS. We implement exactly this:

From this page in the faq

bluss / matrixmultiply

Allow operations on transposed matrices, i.e. Op(A) and Op(B), and DSYRK #25