CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 202 forks source link

gemm performance downgrade for small size M and big size N&K #520

Open abcdrm opened 10 months ago

abcdrm commented 10 months ago

Hi, I run ClBlast Hgemm and Hgemv on qualcomm adreno 730 GPU, the matrix shape is M = 8, N = 3200, K = 3200. Hgemm took 8.05401 ms(average of 1000 times run) to finish, which is much slower than running 8 times Hgemv in a for loops(0.500931 ms per loop, total about 4 ms). Here is code calling gemm and gemv: CLBlastHgemm(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeNo, CLBlastTranspose::CLBlastTransposeNo, M, N, K, alpha, A_mat(), 0, K, B_mat(), 0, N, beta, C_mat(), 0, N, &command_queue(), nullptr);

CLBlastHgemv(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeYes, K, N, alpha, B_mat(), b_offset * i, N, A_mat(), 0, 1, beta, C_mat(), 0, 1, &command_queue(), nullptr);

CNugteren commented 10 months ago

Thank you for reporting this. Yes, it can be that for weird shapes running GEMV multiple times is faster than running GEMM once. I'm not sure there is much we can do here, since the optimization space is vast.

You should try to run the tuner specifically for these matrix sizes, but most likely it won't be faster than your current GEMV solution.