Closed TonyYHsieh closed 1 year ago
This patch does not fix the performance issue of the following tests. Here are the rocblas-bench
results
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,1024,1,1024,0,1024,1024, 349.064, 24608.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,30522,1,1024,0,30522,1024, 370.242, 691540
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,4096,1,1024,0,4096,1024, 355.227, 96726.2
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,4096,4096,1024,1,4096,0,1024,4096, 374.59, 91726.3
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,1024,4096,1,1024,0,1024,1024, 241.365, 35589
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,30522,4096,1,1024,0,30522,1024, 383.288, 668003
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,4096,4096,1,1024,0,4096,1024, 365.729, 93948.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,4096,1024,4096,1,4096,0,1024,4096, 368.739, 93181.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,1024,4096,1024,1,1024,1,1024,1024, 291.11, 29507.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,1024,4096,4096,1,4096,1,4096,1024, 324.671, 105829
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,30522,4096,1024,1,1024,1,1024,30522, 369.845, 692282
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,4096,4096,1024,1,1024,1,1024,4096, 374.495, 91749.6
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,512,512,64,1,512,32768,0,64,32768,512,262144,128, 376.861, 11396.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,64,512,512,1,64,32768,0,512,262144,64,32768,128, 298.106, 14407.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,512,64,512,1,512,262144,0,64,32768,512,32768,128, 338.64, 12683
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,64,512,512,1,64,32768,0,512,262144,64,32768,128, 313.918, 13681.8
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,512,512,64,1,64,32768,0,64,32768,512,262144,128, 366.656, 11713.9
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,64,512,512,1,512,32768,0,512,262144,64,32768,128, 348.782, 12314.2
To reproduce the problem, run
profile-rocblas-dedup.sh.gz
Then run awk '/function/{c=2};c&&c--'
against the output
Hi @xinyazhang,
we run rocblas on Navi31 with patch. Here is result. Performance is way better than your result. Do you really use Navi31?? By the way, Navi32 and Navi33 doesn't have this improvement
close and move to internal PR
resolves #___
Summary of proposed changes: