A faster multiplication with more ymm1; Also make the mask kernel more generic

bluss commented 8 years ago

Add an element trait to access the one, zero elements of f32, f64 and so on.
Then break out the masked kernel logic to make it a general implementation

(However we cheat, to use an array on the stack for now, with the correct size. Can be fixed when we get more than 4x4 kernels. I did check that the complied code is the same when using a vector there as an array.)

I noticed that masked_kernel had its own codegen of the gemm kernel, and that it produced better code! It was using wide AVX registers (ymm1, ymm2 etc) instead of just the xmm1 registers. I don't really understand why this happens. So.. if I change all kernel use to go through masked_kernel, performance improves!
First difference is that masked_kernel writes into a local buffer, with known row/col stride. After that it copies from the buffer to C, the destination.
Second difference is that the kernel is then always called with fixed α=1, β=0.
Third difference is that masked_kernel is inline(never). This seems to be important.

Summary of improvements! With this, we halve the distance to OpenBLAS's f32 benchmark

name                        before ns/iter  after ns/iter    diff ns/iter   diff %
mat_mul_f32::m127           290,740         208,380               -82,360  -28.33%
mat_mul_f32::mix10000       23,430,259      16,291,680         -7,138,579  -30.47%
mat_mul_f64::m127           433,429         343,568               -89,861  -20.73%
mat_mul_f64::mix10000       36,255,323      27,606,065         -8,649,258  -23.86%
nonative_mat_mul_f32::m127  375,725         291,936               -83,789  -22.30%
nonative_mat_mul_f64::m127  503,924         478,248               -25,676   -5.10%

bluss commented 8 years ago

Optimizing compilers, so mercurial. So this can break at any time..

bluss commented 8 years ago

I think the lesson is that we managed to take a simple microkernel and make it simpler, by removing all these factors: α, β, and generic strides for C.

vbarrielle commented 8 years ago

Cool optimization! Funny how sensible the optimizer can be.

A few questions:

(However we cheat, to use an array on the stack for now, with the correct size. Can be fixed when we get more than 4x4 kernels. I did check that the complied code is the same when using a vector there as an array.)

What prevents the stack array from being inside the masked kernel as it was before?

First difference is that masked_kernel writes into a local buffer, with known row/col stride. After that it copies from the buffer to C, the destination. Second difference is that the kernel is then always called with fixed α=1, β=0.

Sounds like the optimizer gets better in a simpler situation, maybe the extra parameters in the non-masked kernel could be removed?

I noticed that masked_kernel had its own codegen of the gemm kernel, and that it produced better code! It was using wide AVX registers (ymm1, ymm2 etc) instead of just the xmm1 registers. I don't really understand why this happens. So.. if I change all kernel use to go through masked_kernel, performance improves!

Wild speculation but I think this is really related to writing to a contiguous buffer (I was tempted to say "on the stack", but the fact that inline(never) is required argues against that). With a contiguous buffer as output there are more opportunities for vectorization.

Sorry for the long comment, just wanted to share my thoughts, I'm super interested in the work you're doing right now :)

bluss commented 8 years ago

What prevents the stack array from being inside the masked kernel as it was before?

You're right, it could stay there for now. I guess it could also be an 8x8 array on the stack, and then it supports all kernels we can imagine at this point?

Why I put it outside is because I was experimenting with putting in the BLIS project's sgemm kernel. It worked pretty well, for the mat_mul_f32::m127 benchmark it clocked around 150,000 ns/iter. Because it is using real AVX aligned loads, it needs that temporary array to be 32-byte aligned. This is not possible for an array, so I used a Vec, that you'd allocate up front.

So in the long run, I expect the mask buffer to be something that's allocated up front and passed a pointer to for this reason.

Sounds like the optimizer gets better in a simpler situation, maybe the extra parameters in the non-masked kernel could be removed?

Yes, good point. I don't think the extra unused parameters are a problem. The kernel is inlined directly into the masked_kernel function, so everything unused should just be removed.

Right now I don't know if the simplified microkernel without α, β, strides is a design that's even better in general, or if it's going to be better to stick with BLIS' design.

bluss commented 8 years ago

The inline(never) is probably lucky to separate the function at a good point. I really don't know. The output to contiguous temporary array is probably an important factor.

bluss / matrixmultiply

A faster multiplication with more ymm1; Also make the mask kernel more generic #1