Closed lxq2t closed 2 months ago
@Ethan-Yan27
@lxq2t.
https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/kernel/direct_convolution.h#L158
Please update this line on your local like below to have a quick fix of smem size issue.
smem_size_ = (max(iterator_A.activation_size, int(sizeof(typename Epilogue::SharedStorage))) * kStages + iterator_B.filter_size);
Initially, this kernel targets fp16, so it is not surprise that you are hitting issue with int8 input.
To fully support int8 input, you need to make sure a few things are working properly. Code contributions are welcome.
activation and filter handle the int8 data type loading correctly:
warp level MMA operator setting is meet expectations: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h#L876-L895
per thread SIMT computation order is correct: https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/thread/depthwise_mma.h
Thanks.
Also here is the comment that explain the basic idea of current depthwise implementation. Hope it helps: https://github.com/NVIDIA/cutlass/issues/1133#issuecomment-1756668121
@Ethan-Yan27 thank you, after applying proposed fix, problem is resolved.
We not encountered any other issues with correctness of depthwise convolution with int8 input and int32 accumulator.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Describe the bug
In the example for working with depthwise convolution, the half type is used as the data type and accumulator, and for our task we are trying to reuse the kernel for the int8 type.
When try to run the example by replacing the half type with a signed char type and an int accumulator, an illegal memory access occurs in the line:
https://github.com/NVIDIA/cutlass/blob/44c704eae85da352d277d6f092f41412772f70e4/include/cutlass/epilogue/warp/tile_iterator_simt.h#L521
Is it necessary to update any additional parameters when using int8_t with int32_t accumulator, other than input/accumulator/epilogue types?
Steps/Code to reproduce bug
Modified code from "46_depthwise_simt_conv2dfprop" example:
Observed output
cuda-gdb output for code compiled with "-G":
Expected behavior
Successful launch of example and output of runtime and FLOPS equal to initial example "46_depthwise_simt_conv2dfprop".
Environment details (please complete the following information):
Reproduced at:
11.4
, driver -470.103.01
11.8
, driver -550.54.14
11.4
, driver -470.103.01
Additional context
Convolution problem size options: activation -
[1,96,160,160]
filter -[96,1,3,3]
stride -[1,1]
padding -[1,1]
dilation -[1,1]
As possible issue, there is a may be insufficient size of shared memory calculated at kernel parameters, if we change smemsize to large block (for example 32kb), kernel runs successfully.
https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/device/direct_convolution.h#L228