[BUG] illegal memory access for depthwise convolution with int8 type

lxq2t commented 6 months ago

Describe the bug

In the example for working with depthwise convolution, the half type is used as the data type and accumulator, and for our task we are trying to reuse the kernel for the int8 type.

When try to run the example by replacing the half type with a signed char type and an int accumulator, an illegal memory access occurs in the line:

https://github.com/NVIDIA/cutlass/blob/44c704eae85da352d277d6f092f41412772f70e4/include/cutlass/epilogue/warp/tile_iterator_simt.h#L521

Is it necessary to update any additional parameters when using int8_t with int32_t accumulator, other than input/accumulator/epilogue types?

Steps/Code to reproduce bug

Modified code from "46_depthwise_simt_conv2dfprop" example:

// The code section below describes datatype for input, output tensors and computation between
// elements
using ElementAccumulator = int32_t;      // < modified
using ElementComputeEpilogue = float;  // Data type of epilogue computation (alpha, beta)
using ElementInputA = int8_t;           // < modified
using ElementInputB = int8_t;           // < modified
using ElementOutput = int8_t;           // < modified

using LayoutInputA = cutlass::layout::TensorNHWC;
using LayoutInputB = cutlass::layout::TensorNHWC;
using LayoutOutput = cutlass::layout::TensorNHWC;

// This code section describes whether you want to use tensor cores or regular SIMT cores on GPU SM
using MMAOp = cutlass::arch::OpClassSimt;

// This code section describes CUDA SM architecture number
using SmArch = cutlass::arch::Sm60;

// This code section describes the groups a thread block will compute
constexpr int groups_per_cta = 32; //< modified

// This code section describes the output tile <N, O, P, Q> a thread block will compute
using ThreadBlockOutputShape = cutlass::conv::TensorNHWCShape<1, 8, 8, groups_per_cta>;

// This code section describes the filter shape <R, S>
using FilterShape = cutlass::MatrixShape<3, 3>;

// Threadblock tile shape
using ThreadblockShape =
    cutlass::gemm::GemmShape<ThreadBlockOutputShape::kNHW, groups_per_cta, FilterShape::kCount>;

// This code section describes tile size a warp will computes
// WarpShape::kM = P * Q the warps would process
// WarpShape::kN = groups_per_cta that the warps would process
// WarpShape::kK = filter_size that the warps would process
using WarpShape = cutlass::gemm::GemmShape<16, groups_per_cta, FilterShape::kCount>;

// This code section describes the size of MMA op
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock =
    cutlass::conv::threadblock::DepthwiseDirect2dConvIdentityThreadblockSwizzle<
        1,
        ThreadBlockOutputShape::kN,
        ThreadBlockOutputShape::kH,
        ThreadBlockOutputShape::kW>;

// Number of pipelines you want to use
constexpr int NumStages = 4;

// This code section describe iterator algorithm selected is kFixedStrideDilation
static cutlass::conv::IteratorAlgorithm const IteratorAlgorithm =
    cutlass::conv::IteratorAlgorithm::kFixedStrideDilation;
using StrideShape = cutlass::MatrixShape<1, 1>;
using DilationShape = cutlass::MatrixShape<1, 1>;

constexpr int kEpilogueElementsPerAccess = 128 / cutlass::sizeof_bits<ElementOutput>::value;

// This code section describes the epilogue part of the kernel, we use default value
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
    ElementOutput,               // Data type of output matrix.
    kEpilogueElementsPerAccess,  // The number of elements per vectorized.
    // memory access. This becomes the vector width of
    // math instructions in the epilogue too.
    ElementAccumulator,          // Data type of accumulator
    ElementComputeEpilogue,      // Data type for alpha/beta in linear combination
    cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling>;  // Epilogue scaling operation.

using DepthwiseDirect2dConv = typename cutlass::conv::kernel::DefaultDepthwiseDirect2dConvFprop<
    ElementInputA,
    LayoutInputA,
    ElementInputB,
    LayoutInputB,
    ElementOutput,
    LayoutOutput,
    ElementAccumulator,
    MMAOp,
    SmArch,
    ThreadblockShape,
    ThreadBlockOutputShape,
    FilterShape,
    WarpShape,
    InstructionShape,
    EpilogueOp,
    SwizzleThreadBlock,
    NumStages,
    cutlass::arch::OpMultiplyAdd,
    IteratorAlgorithm,
    cutlass::conv::StrideSupport::kFixed,
    StrideShape,
    DilationShape>::Kernel;

using Direct2dConv = cutlass::conv::device::DirectConvolution<DepthwiseDirect2dConv>;

Observed output

cudaEventSynchronize() failed: an illegal memory access was encountered
Layer,N,H,W,C,K,R,S,G,stride_h,stride_w,dilation_h,dilation_w,splitK,Runtime,GFLOPs
conv_1,1,160,160,96,96,3,3,96,1,1,1,1,1,0,0

cuda-gdb output for code compiled with "-G":

Thread 1 "dwConvRepro" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (96,0,0), device 0, sm 0, warp 0, lane 0]
0x00007fffc5355070 in cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<cutlass::gemm::GemmShape<16, 64, 9>, cutlass::conv::TensorNHWCShape<1, 2, 4, 4>, cutlass::conv::TensorNHWCShape<1, 8, 8, 64>, cutlass::conv::thread::DepthwiseDirectConvElementwiseInnerProduct<cutlass::gemm::GemmShape<8, 4, 1>, signed char, signed char, int, cutlass::arch::OpMultiplyAdd, bool>, int, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaSimtPolicy<cutlass::MatrixShape<2, 16>, cutlass::layout::RowMajorInterleaved<1>, cutlass::gemm::GemmShape<4, 4, 1> > >::store_with_pointer_offset (this=0x7fffe3fffc70, frag=..., pointer_offset=0)
    at /mnt/storage/env/cutlass/epilogue/warp/tile_iterator_simt.h:521
521               storer_pointer_[offset + pointer_offset / int(AccessType::kElements)] =

Expected behavior

Successful launch of example and output of runtime and FLOPS equal to initial example "46_depthwise_simt_conv2dfprop".

Environment details (please complete the following information):

Reproduced at:

Turing Tesla T4, CUDA 11.4, driver - 470.103.01
Ampere 3050ti, CUDA 11.8, driver - 550.54.14
Ampere A30, CUDA 11.4, driver - 470.103.01

Additional context

Convolution problem size options: activation - [1,96,160,160] filter - [96,1,3,3] stride - [1,1] padding - [1,1] dilation - [1,1]

As possible issue, there is a may be insufficient size of shared memory calculated at kernel parameters, if we change smemsize to large block (for example 32kb), kernel runs successfully.

https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/device/direct_convolution.h#L228

hwu36 commented 6 months ago

@Ethan-Yan27

Ethan-Yan27 commented 6 months ago

@lxq2t.
https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/kernel/direct_convolution.h#L158 Please update this line on your local like below to have a quick fix of smem size issue.

smem_size_ = (max(iterator_A.activation_size, int(sizeof(typename Epilogue::SharedStorage))) * kStages + iterator_B.filter_size);

Initially, this kernel targets fp16, so it is not surprise that you are hitting issue with int8 input.
To fully support int8 input, you need to make sure a few things are working properly. Code contributions are welcome.

activation and filter handle the int8 data type loading correctly:
- filter: https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock#:~:text=2%20months%20ago-,depthwise_fprop_filter_tile_access_iterator_direct_conv_optimized.h,-Update%20license%20year
- Activation: https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock#:~:text=2%20months%20ago-,depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h,-Update%20license%20year
warp level MMA operator setting is meet expectations: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h#L876-L895
per thread SIMT computation order is correct: https://github.com/NVIDIA/cutlass/blob/ffa34e70756b0bc744e1dfcc115b5a991a68f132/include/cutlass/conv/thread/depthwise_mma.h

Thanks.

Ethan-Yan27 commented 6 months ago

Also here is the comment that explain the basic idea of current depthwise implementation. Hope it helps: https://github.com/NVIDIA/cutlass/issues/1133#issuecomment-1756668121

lxq2t commented 6 months ago

@Ethan-Yan27 thank you, after applying proposed fix, problem is resolved.

We not encountered any other issues with correctness of depthwise convolution with int8 input and int32 accumulator.

github-actions[bot] commented 5 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

NVIDIA / cutlass

[BUG] illegal memory access for depthwise convolution with int8 type #1402