NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.61k stars 955 forks source link

cutlass calculate matrix size problem #75

Closed wanghr323 closed 4 years ago

wanghr323 commented 4 years ago

I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function?

int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const void *alpha, const void* A, const void* B, const void *beta,
    void* C,cublasGemmAlgo_t algo/*non used*/)
    {
            using A_Major = cutlass::layout::ColumnMajor;
            using B_Major = cutlass::layout::ColumnMajor;
            using ElementOutput = int32_t;
            using ElementAccumulator = int32_t;
            int lda = (TransA == CblasNoTrans) ? K : M;
            int ldb = (TransB == CblasNoTrans) ? N : K;
            int ldc = N;
            using Gemm = cutlass::gemm::device::Gemm<
            int8_t,
            A_Major,
            int8_t,
            B_Major,
            ElementOutput,
            cutlass::layout::RowMajor,
            ElementAccumulator,
            cutlass::arch::OpClassWmmaTensorOp,
            cutlass::arch::Sm75,
            cutlass::gemm::GemmShape<128, 128, 32>,
            cutlass::gemm::GemmShape<64, 64, 32>,
            cutlass::gemm::GemmShape<16, 16, 16>,
            cutlass::epilogue::thread::LinearCombination<
            ElementOutput,
            128 / cutlass::sizeof_bits<ElementOutput>::value,
            ElementAccumulator,
            ElementAccumulator
            >,
            cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
            2
        >;
        Gemm gemm_op;
        int alpha_ = *(static_cast<const int*>(alpha));
        int beta_ = *(static_cast<const int*>(beta));
        cutlass::Status status = gemm_op({
            {M, N, K},
            {static_cast<const int8_t *>(A), lda},
            {static_cast<const int8_t *>(B), ldb},
            {static_cast<int*>(C), ldc},
            {static_cast<int*>(C), ldc}, 
            {alpha_,beta_}
        });
        if (status != cutlass::Status::kSuccess) {
            return cudaErrorUnknown;
          }
          return cudaSuccess;
    }
kerrmudgeon commented 4 years ago

This is expected behavior. As a performance optimization, most kernels targeting Tensor Cores require 128b aligned memory accesses. This enables vectorized memory accesses, fewer address instructions, and greater efficiency. It is a documented requirement in cublas, described here in greater detail https://docs.nvidia.com/cuda/cublas/index.html#tensorop-restrictions.

Here are some additional suggestions for higher performance on the Turing architecture:

  1. Target Tensor Cores natively with cutlass::arch::OpClassTensorOp rather than WMMA. This utilizes feature new in CUDA 10.2.

using Gemm = cutlass::gemm::device::Gemm< int8_t, cutlass::layout::RowMajor, int8_t, cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<8, 8, 16>

;

  1. Adapt your application to structure the A matrix as row-major.

On Mon, Dec 23, 2019 at 5:36 AM wanghr323 notifications@github.com wrote:

I wrote a function using cutlass to test the performance of cutlass calculation (int8, int8 to int), but I have now found a problem. M, N, and K in my parameters cannot be selected at random, where N and K must Multiples of 16. Choosing something else will cause an error. Is there something wrong with my writing of this function?

int Int8Operator:: cutlass_gemm32I_tensorop(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const void alpha, const void A, const void B, const void beta, void C,cublasGemmAlgo_t algo/non used*/) { using A_Major = cutlass::layout::ColumnMajor; using B_Major = cutlass::layout::ColumnMajor; using ElementOutput = int32_t; using ElementAccumulator = int32_t; int lda = (TransA == CblasNoTrans) ? K : M; int ldb = (TransB == CblasNoTrans) ? N : K; int ldc = N; using Gemm = cutlass::gemm::device::Gemm< int8_t, A_Major, int8_t, B_Major, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassWmmaTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 16, 16>, cutlass::epilogue::thread::LinearCombination< ElementOutput, 128 / cutlass::sizeof_bits::value, ElementAccumulator, ElementAccumulator

, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle, 2 ; Gemm gemmop; int alpha = (static_cast<const int>(alpha)); int beta_ = (static_cast<const int>(beta)); cutlass::Status status = gemm_op({ {M, N, K}, {static_cast<const int8_t >(A), lda}, {static_cast<const int8_t >(B), ldb}, {static_cast<int>(C), ldc}, {static_cast<int>(C), ldc}, {alpha,beta} }); if (status != cutlass::Status::kSuccess) { return cudaErrorUnknown; } return cudaSuccess; }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/cutlass/issues/75?email_source=notifications&email_token=AAACX6ZQYN6CPASEURBHXF3Q2C5HVA5CNFSM4J6UH4V2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ICKSZOQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACX64YETMNR2KKCM5IP63Q2C5HVANCNFSM4J6UH4VQ .

wanghr323 commented 4 years ago

OK,that is to say , M ,N ,K at least two of them should be Multiples of 16. Thank u , I will close the issue.

wanghr323 commented 4 years ago

thank u for your reply,Kerr,Then I have a need in my job now, calculating C (int) = A (int8) × B (int8), where I want A, B, and C to be Rowmajor matrices, the size of A is M × K, and the size of B is K × N, the size of C is M N. I can guarantee that K is a multiple of 16, and M can be converted to a multiple of 16 (if you can choose it arbitrarily, it is the best, if not, it is fine), but N must be a random number. How do I achieve it with cutlass? I tested all combinations in cutlass. If ABC is rowmajor, then N and K must be multiples of 16. If I convert my thoughts and convert A × B to B.trans A.trans (ABC selects column_major, and brings it back in), then M becomes N and N becomes M, this time it becomes, N can be chosen at will, M must be a multiple of 16, still cannot solve my problem. Can this problem be solved by cutlass? It's fine if you don't use tensorcoreop, or even wmma.

kerrmudgeon commented 4 years ago

Here are a three possible recourses:

1.) Padding.

Size the matrices such that they are divisible by 16 elements and initialize the extra elements with zero.

2.) Reduce the alignment requirement at the expense of performance.

The device-level GEMM API accepts an admittedly long list of template arguments including the alignment constraints.

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L201

using Gemm = cutlass::gemm::device::Gemm<
      int8_t,
      cutlass::layout::RowMajor,
      int8_t,
      B_Major,
      cutlass::layout::ColumnMajor,
      cutlass::layout::RowMajor,
      ElementAccumulator,
      cutlass::arch::OpClassTensorOp,
      cutlass::arch::Sm75,
      cutlass::gemm::GemmShape<128, 128, 64>,
      cutlass::gemm::GemmShape<64, 64, 64>,
      cutlass::gemm::GemmShape<8, 8, 16>,
      cutlass::epilogue::thread::LinearCombination<
        ElementOutput,
        1,     // alignment of C units
        ElementAccumulator,
        ElementAccumulator
      >,
      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle,
      2,
      1,   // alignment of A in units of number of elements
      1    // alignment of B in units of number of elements
  >;

3.) Use the integer-valued SIMT kernels.

You may consider using a kernel targeting integer dot product "dp4" instructions, first available in the Pascal microarchitecture and beyond.

Here is the definition syntax, visible in unit tests for these kernels. https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/device/simt_int8_igemm_sm61.cu

  // Output data type - may be int8_t or int32_t
  using ElementOutput = int8_t;

  // Accumulator data type
  using ElementAccumulator = int32_t;

  // Scalar data type
  using ElementCompute = float;

  // Instruction shape - describes a 1x1x4 dot product computed by
  // the "dp4" instruction.
  using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>;

  using Gemm = cutlass::gemm::device::Gemm<
    int8_t,
    cutlass::layout::ColumnMajor,
    int8_t,
    cutlass::layout::ColumnMajor,
    ElementOutput,
    cutlass::layout::RowMajor,
    int32_t,
    cutlass::arch::OpClassSimt,
    cutlass::arch::Sm61,
    ThreadBlockShape,
    WarpShape,
    InstructionShape
  >;

There is no restriction on M, N, or K, but the matrices themselves must be 32b aligned. That is, pointers and leading dimensions must be divisible by 4 bytes.

wanghr323 commented 4 years ago

thank you for your help.I will close the question.