TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
157 stars 10 forks source link

fix(cell): Reduce bank conflicts when accessing shared memory tiles with float data type #145

Closed haruhi55 closed 1 month ago

haruhi55 commented 1 month ago
  1. This pull request reduces bank conflicts for register-to-shared and shared-to-global storer. In the current implementation, storing a single BaseTile in shared memory causes 8 bank conflicts.
  2. A thorough fix to make the storing process free of bank conflicts requires more careful consideration. I hope to merge this fix and benchmark the end-to-end performance for GEMM.
  3. The loaders are bank-conflict free.

The data distribution for a single BaseTile according to the swizzle function used in this fix is as follows:

data_distribution

haruhi55 commented 1 month ago

Tested on A100, for the below test case:

test_row_major_store<float, tl::RowMajor<2, 2>, 64, 64, kSwizzled>();

The bank conflicts are reduced from:

------------------------------------------------------------- ----------- ------------
Metric Name                                                   Metric Unit Metric Value
------------------------------------------------------------- ----------- ------------
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum                                 768
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_atom.sum                           0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                           384
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ldgsts.sum                         0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum                           384
------------------------------------------------------------- ----------- ------------

to:

------------------------------------------------------------- ----------- ------------
Metric Name                                                   Metric Unit Metric Value
------------------------------------------------------------- ----------- ------------
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum                                 256
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_atom.sum                           0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                           128
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ldgsts.sum                         0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum                           128
------------------------------------------------------------- ----------- ------------

But there are still bank conflicts.