clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

Performance improvement for sgemm column major TN (transposeA = T, transposeB = N) case #54

Closed TimmyLiu closed 9 years ago

TimmyLiu commented 9 years ago

Even with the use of tuning tool, the current sgemm TN still pose a poor performance comparing to sgemm NN, sgemm NT and sgemm TT. This pull request propose a wrapper from sgemm TN to sgemm NN by doing the transposition of A in a separate kernel, so that the sgemm TN can benefit from the performance of sgemm NN.

Note that since a out-of-place transposition was implemented, an extra opencl buffer was created within this wrapper. This might be a issue for really big matrix sizes.

To enable this wrapper, one would need to set env CLBLAS_FAST_SGEMM_TN=1. The code was only tested on "Spectre", "Tahiti" and "Hawaii" devices. Thus, at the moment, if the environment variable was not set or if the hardware device is anything other than "Spectre", "Tahiti" and "Hawaii", the "old" kernel without transposition will be called.