ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
336 stars 157 forks source link

add navi31 F32 logic yaml #1353

Closed TonyYHsieh closed 1 year ago

TonyYHsieh commented 1 year ago

resolves #___

Summary of proposed changes:

xinyazhang commented 1 year ago

This patch does not fix the performance issue of the following tests. Here are the rocblas-bench results

gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,1024,1,1024,0,1024,1024, 349.064, 24608.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,30522,1,1024,0,30522,1024, 370.242, 691540
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,1024,4096,4096,1,1024,0,4096,1024, 355.227, 96726.2
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,4096,4096,1024,1,4096,0,1024,4096, 374.59, 91726.3
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,1024,4096,1,1024,0,1024,1024, 241.365, 35589
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,30522,4096,1,1024,0,30522,1024, 383.288, 668003
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,1024,4096,4096,1,1024,0,4096,1024, 365.729, 93948.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,4096,1024,4096,1,4096,0,1024,4096, 368.739, 93181.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,1024,4096,1024,1,1024,1,1024,1024, 291.11, 29507.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,1024,4096,4096,1,4096,1,4096,1024, 324.671, 105829
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,30522,4096,1024,1,1024,1,1024,30522, 369.845, 692282
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,beta,ldb,ldc,rocblas-Gflops,us
gemm,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,4096,4096,1024,1,1024,1,1024,4096, 374.495, 91749.6
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,512,512,64,1,512,32768,0,64,32768,512,262144,128, 376.861, 11396.7
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,N,64,512,512,1,64,32768,0,512,262144,64,32768,128, 298.106, 14407.5
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,512,64,512,1,512,262144,0,64,32768,512,32768,128, 338.64, 12683
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,N,T,64,512,512,1,64,32768,0,512,262144,64,32768,128, 313.918, 13681.8
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,512,512,64,1,64,32768,0,64,32768,512,262144,128, 366.656, 11713.9
function,a_type,b_type,c_type,d_type,compute_type,transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,batch_count,rocblas-Gflops,us
gemm_strided_batched,f32_r,f32_r,f32_r,f32_r,f32_r,T,N,64,512,512,1,512,32768,0,512,262144,64,32768,128, 348.782, 12314.2

To reproduce the problem, run profile-rocblas-dedup.sh.gz Then run awk '/function/{c=2};c&&c--' against the output

TonyYHsieh commented 1 year ago

Hi @xinyazhang,

we run rocblas on Navi31 with patch. Here is result. Performance is way better than your result. Do you really use Navi31?? By the way, Navi32 and Navi33 doesn't have this improvement

image

TonyYHsieh commented 1 year ago

close and move to internal PR