Open abcdrm opened 10 months ago
Thank you for reporting this. Yes, it can be that for weird shapes running GEMV multiple times is faster than running GEMM once. I'm not sure there is much we can do here, since the optimization space is vast.
You should try to run the tuner specifically for these matrix sizes, but most likely it won't be faster than your current GEMV solution.
Hi, I run ClBlast Hgemm and Hgemv on qualcomm adreno 730 GPU, the matrix shape is M = 8, N = 3200, K = 3200. Hgemm took 8.05401 ms(average of 1000 times run) to finish, which is much slower than running 8 times Hgemv in a for loops(0.500931 ms per loop, total about 4 ms). Here is code calling gemm and gemv:
CLBlastHgemm(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeNo, CLBlastTranspose::CLBlastTransposeNo, M, N, K, alpha, A_mat(), 0, K, B_mat(), 0, N, beta, C_mat(), 0, N, &command_queue(), nullptr);
CLBlastHgemv(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeYes, K, N, alpha, B_mat(), b_offset * i, N, A_mat(), 0, 1, beta, C_mat(), 0, 1, &command_queue(), nullptr);