Closed bluss closed 8 years ago
Optimizing compilers, so mercurial. So this can break at any time..
I think the lesson is that we managed to take a simple microkernel and make it simpler, by removing all these factors: α, β, and generic strides for C.
Cool optimization! Funny how sensible the optimizer can be.
A few questions:
(However we cheat, to use an array on the stack for now, with the correct size. Can be fixed when we get more than 4x4 kernels. I did check that the complied code is the same when using a vector there as an array.)
What prevents the stack array from being inside the masked kernel as it was before?
First difference is that masked_kernel writes into a local buffer, with known row/col stride. After that it copies from the buffer to C, the destination. Second difference is that the kernel is then always called with fixed α=1, β=0.
Sounds like the optimizer gets better in a simpler situation, maybe the extra parameters in the non-masked kernel could be removed?
I noticed that masked_kernel had its own codegen of the gemm kernel, and that it produced better code! It was using wide AVX registers (ymm1, ymm2 etc) instead of just the xmm1 registers. I don't really understand why this happens. So.. if I change all kernel use to go through masked_kernel, performance improves!
Wild speculation but I think this is really related to writing to a contiguous buffer (I was tempted to say "on the stack", but the fact that inline(never)
is required argues against that). With a contiguous buffer as output there are more opportunities for vectorization.
Sorry for the long comment, just wanted to share my thoughts, I'm super interested in the work you're doing right now :)
What prevents the stack array from being inside the masked kernel as it was before?
You're right, it could stay there for now. I guess it could also be an 8x8 array on the stack, and then it supports all kernels we can imagine at this point?
Why I put it outside is because I was experimenting with putting in the BLIS project's sgemm kernel. It worked pretty well, for the mat_mul_f32::m127 benchmark it clocked around 150,000 ns/iter. Because it is using real AVX aligned loads, it needs that temporary array to be 32-byte aligned. This is not possible for an array, so I used a Vec, that you'd allocate up front.
So in the long run, I expect the mask buffer to be something that's allocated up front and passed a pointer to for this reason.
Sounds like the optimizer gets better in a simpler situation, maybe the extra parameters in the non-masked kernel could be removed?
Yes, good point. I don't think the extra unused parameters are a problem. The kernel is inlined directly into the masked_kernel function, so everything unused should just be removed.
Right now I don't know if the simplified microkernel without α, β, strides is a design that's even better in general, or if it's going to be better to stick with BLIS' design.
The inline(never) is probably lucky to separate the function at a good point. I really don't know. The output to contiguous temporary array is probably an important factor.
(However we cheat, to use an array on the stack for now, with the correct size. Can be fixed when we get more than 4x4 kernels. I did check that the complied code is the same when using a vector there as an array.)
inline(never)
. This seems to be important.Summary of improvements! With this, we halve the distance to OpenBLAS's f32 benchmark