[QST] how bank conflict in shared memory is fixed in depthwise conv

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

Other

4.97k stars 850 forks source link

[QST] how bank conflict in shared memory is fixed in depthwise conv #1256

Closed yupatrick22 closed 5 months ago

yupatrick22 commented 7 months ago

What is your question? bank conflict plays extremely important role in smem perf. how is it solved in depthwise conv? @Ethan-Yan27

Ethan-Yan27 commented 7 months ago

The elements loaded are just consecutively stored in smem. The bottleneck of depthwise conv is mainly in Dram and L2, so did not do padding or swizzling techniques.
If you are interested in how to implement smem related operation. please refer to https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h#L847

https://github.com/NVIDIA/cutlass/blob/f4a021660162510572f90ea715b018cff9c0f12f/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear_direct_conv.h

yupatrick22 commented 7 months ago

From the source code, it looks like 2 different data reuse strategies are for Koptimized and KfixedStrideDilation. For Koptimized, as above code shows, each thread will first calculate the offset then load fragment A (size is tileP*tileQ) from smem, it will repeat RS times.

While for KfixedStrideDilation, the input tile (i.e. all the dependent activation to calculate fragment C) will first be loaded in register file, then performs static (that compiler can handles at complied time) load from the input tile into fragment A.

Why Koptimized is designed like this?

What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?

@Ethan-Yan27

yupatrick22 commented 7 months ago

Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c? @Ethan-Yan27

Ethan-Yan27 commented 7 months ago

What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?

Right, If we apply similar strategy, kernel would probably hit register spilling issue.

In general, for KfixedStrideDilation, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommended.

Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c?

No, it would not write to tensor_c. because epilogue scale operation is OnlyAlphaScaling, the tensor_c would be unused.

mnicely commented 6 months ago

@yupatrick22 has your issue been resolved?

github-actions[bot] commented 5 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.