improve big sgemm column NN perf: replace barrier() with mem_fence() in the inner loop.
improve small sgemm NN perf: for small sgemm (M_N < 900_900) and M or N is not multiples of 32, use kernel with micro tile size 2 by 2 instead of micro tile size 6 by 6.
Note: kernel with other micro tile sizes might have better performance than these 2 cases. Finer tuned heuristic of switching from kernel to kernel is also a good to have.
improve big sgemm column NN perf: replace barrier() with mem_fence() in the inner loop. improve small sgemm NN perf: for small sgemm (M_N < 900_900) and M or N is not multiples of 32, use kernel with micro tile size 2 by 2 instead of micro tile size 6 by 6. Note: kernel with other micro tile sizes might have better performance than these 2 cases. Finer tuned heuristic of switching from kernel to kernel is also a good to have.