gemm performance downgrade for small size M and big size N&K

CNugteren / CLBlast

Tuned OpenCL BLAS

Apache License 2.0

1.06k stars 202 forks source link

Hi, I run ClBlast Hgemm and Hgemv on qualcomm adreno 730 GPU, the matrix shape is M = 8, N = 3200, K = 3200. Hgemm took 8.05401 ms(average of 1000 times run) to finish, which is much slower than running 8 times Hgemv in a for loops(0.500931 ms per loop, total about 4 ms). Here is code calling gemm and gemv: CLBlastHgemm(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeNo, CLBlastTranspose::CLBlastTransposeNo, M, N, K, alpha, A_mat(), 0, K, B_mat(), 0, N, beta, C_mat(), 0, N, &command_queue(), nullptr);

CLBlastHgemv(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeYes, K, N, alpha, B_mat(), b_offset * i, N, A_mat(), 0, 1, beta, C_mat(), 0, 1, &command_queue(), nullptr);

CNugteren / CLBlast

gemm performance downgrade for small size M and big size N&K #520