clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

fix the performance drop of SGEMM column major NT or row major TN when lda and ldb are big multiples of 1024 such as 4096, 5120, 6144, 7168, 8192 #133

Closed TimmyLiu closed 9 years ago

TimmyLiu commented 9 years ago

This PR is an attempt to fix the performance drop of SGEMM column major NT or row major TN when lda and ldb are big multiples of 1024 such as 4096, 5120, 6144, 7168, 8192 and so forth.

The performance drop at those sizes are likely caused by cache trashing that multiple CUs are trying to read from the same global memory channel. The easiest way to fix this problem is to pad the input data by lda and ldb (such as lda = 4097, ldb = 4097). However, an OpenCL level library does not have the control over how users allocate the host memory. In this regard, it is worthwhile to try to fix this problem from within the library.

At size 4096 and 5120 we choose to use kernels with macro tile size of 128 instead of 96. Doing this effectively reduces the number of work groups and thus reduces the chance of different CUs reading from the same channel at the same time.

At size 6144 we choose to still use kernels with macro tile size of 96 (uses fewer register as opposed to 128 approach). The change made here is instead of calling one GEMM kernel where M=N=K=6144 we make 4 identical GEMM calls where M=N=6144 and K=1536.

This is based on some experimental results on kernel with macro tile size of 96 shown below. sgemm4096_sweepk

It can be seen when K>3584 in our experiment the performance starts to drop. Note that the 4-kernel approach requires updates on offsetA, offsetB and beta.

It seems for even bigger sizes the 4-kernel approach is no longer sufficient. For size 7168 and 8192 we attempt to split C matrix as well. For example. instead of calling one GEMM kernel where M=N=K=8192 we make 16 identical GEMM calls where M=N=4096 and K=2048. Note that updates on offsetA, offsetB, offsetC and beta are required.

Below shows the performance graph from this PR against current clBLAS performance.

clblas_sgemmnt_channel

Note that for even bigger matrix sizes (> 8192) it is likely that we need to further split the matrices. Analysis on non-square cases (lda!=ldb, lda%1024 == 0, ldb%1024==0) might also be interesting.