Closed yupatrick22 closed 5 months ago
The elements loaded are just consecutively stored in smem. The bottleneck of depthwise conv is mainly in Dram and L2, so did not do padding or swizzling techniques.
If you are interested in how to implement smem related operation. please refer to https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h#L847
From the source code, it looks like 2 different data reuse strategies are for Koptimized and KfixedStrideDilation. For Koptimized, as above code shows, each thread will first calculate the offset then load fragment A (size is tileP*tileQ) from smem, it will repeat RS times.
While for KfixedStrideDilation, the input tile (i.e. all the dependent activation to calculate fragment C) will first be loaded in register file, then performs static (that compiler can handles at complied time) load from the input tile into fragment A.
Why Koptimized is designed like this?
What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?
@Ethan-Yan27
Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c? @Ethan-Yan27
What will happen, if Koptimized uses the strategy of KfixedStrideDilation? Thread local memory will be used?
Right, If we apply similar strategy, kernel would probably hit register spilling issue.
In general, for KfixedStrideDilation, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommended.
Since the sample code 46 do the only alpha scaling epilogue, I think the kernel will output data to tensor_d, but will it also output to tensor_c?
No, it would not write to tensor_c. because epilogue scale operation is OnlyAlphaScaling, the tensor_c would be unused.
@yupatrick22 has your issue been resolved?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
What is your question? bank conflict plays extremely important role in smem perf. how is it solved in depthwise conv? @Ethan-Yan27