NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.61k stars 955 forks source link

How to use CUTLASS GEMM for small matrix #293

Closed ju9379 closed 2 years ago

ju9379 commented 3 years ago

Hi,

I would like to execute a CUTLASS GEMM (A*B+C) that uses the Tensor Cores on my Volta architecture with :

So, it is a 64x6x123 (MxNxK) GEMM.

I look to the GEMM with the following properties : Opcode Class : TensorOp DataType : fp16 * fp16 + fp16 = fp16 Compute Capability : 75

There is no instruction shape as 64x8x128 for this Opcode Class (see README>Funtionality). But I can decompose it in multiple 8x8x128 GEMMs.

The thing is, I have small matrix to process. Usually, CUTLASS GEMM decomposes big matrix in smaller GEMM at 3 levels (thread block, warp and instruction). Is this deconstruction supported with small matrix ? What is the smallest GEMM we can do? Is it possible to execute a CUTLASS GEMM directly on the instruction level ? Is there some simple examples you could provided that work for small matrix?

Also : Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h". Is it possible to use CUTLASS GEMM from a CUDA kernel ?

Thanks in advance, Julie

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-22

hwu36 commented 3 years ago

For the beginning, you can use threadblock tile size 64x64x32, warp tile size 32x32x32 to run your problem size. I noticed your problem size is not multiple of 8 in N and K dimensions. Usually you can get best performance if they are multiple of 8. You can pad your matrices to achieve this. If you cannot do padding, you can set alignment to 1 to do it. You can take a look at this example about how to run volta gemm: https://github.com/NVIDIA/cutlass/tree/master/examples/07_volta_tensorop_gemm.

Do you have a simple example in which the user pass his own matrix to the GEMM function instead of using "host_tensor.h". Is it possible to use CUTLASS GEMM from a CUDA kernel ?

To run a gemm, you just need a device side pointer and a stride (which is essentially a Layout) and pass them to a TensorRef. Use this constructor: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/tensor_ref.h#L206-L213.

Example code is like this:

    cutlass::layout::RowMajor layout(stride);

    ElementOutput *ptr = static_cast<ElementOutput *>(pointer);
    cutlass::TensorRef<ElementOutput, cutlass::layout::RowMajor> tensorref(ptr, layout);
ju9379 commented 3 years ago

Thanks for your answer @hwu36,

First, I am sorry I inverted the dimension of matrix A and B, the correct dimensions are : matrix A size = 64x6 matrix B size = 6x123 I have already plan to do padding on the N and K dimensions by doing a 64x8x128 GEMM.

I looked at the example, but the dimensions are too large for my case. So much padding would be a waste. It would be great to have at least N = 8. My question is : could we have a input probleme size smaller than the threadblock tile size? If not, is it possible to execute a CUTLASS GEMM directly on the instruction level/size ?

My goal is to implement a 1D-convolution that uses Tensor Cores (A represents the filters, B represents the shifted signal input).

PS : I notice a contradiction in the volta_tensorop_gemm.cu example. In comments, it says:

"we create template variables of tile sizes for thread-block, warp and mma-op to 128x128x32,
64x64x4, 8x8x4 (MxNxK) respectively"

In the code :

cutlass::gemm::GemmShape<128, 128, 32>;  // <- threadblock tile M = 128, N = 128, K = 32
cutlass::gemm::GemmShape<64, 64, 32>;  // <- warp tile M = 64, N = 64, K = 32 
cutlass::gemm::GemmShape<8, 8, 4>;  // <- MMA Op tile M = 8, N = 8, K = 4

In the README>Functionality, the instruction shape 8x8x4 does not match 64x64x32 but 64x64x4.

Also, is there a reason why I did not found a correlation table between thread-block tile shapes and warp tile shapes?

hwu36 commented 3 years ago

could we have a input probleme size smaller than the threadblock tile size?

Yes. As I said earlier, you can use threadblock tile size 64x64x32, warp tile size 32x32x32.

The comment in the volta gemm example is outdated. warp tile k is 32, not 4.

Also, is there a reason why I did not found a correlation table between thread-block tile shapes and warp tile shapes?

A threadblock tile can work with several different warp tiles. Generally, we'd like to use 4 or 8 warps per threadblock. Common warp tile is 64x64, 64x32, 32x64, 32x32.

ju9379 commented 3 years ago

Thanks for your answer. I started from the 07_volta_tensorop_gemm.cu code example and modified it. Following your indications, I created 4 TensorRef for each matrix A, B, C and D (D = alpha A B + beta * C) by passing their respective pointer and stride ( I understood that the stride was the number of rows for a column major matrix and the number of columns for a row major matrix) :

    cutlass::layout::RowMajor layout_a(stride_a);
    ElementInputA*ptr_a = static_cast<ElementInputA*>(pointer_a);
    cutlass::TensorRef<ElementInputA, cutlass::layout::ColumnMajor> tensor_a(ptr_a, layout_a);

Then I pass the 4 TensorRef directly to the gemm kernel arguments :

typename Gemm::Arguments arguments{problem_size,  // <- problem size of matrix multiplication
                                     tensor_a,  // <- reference to matrix A on device
                                     tensor_b,  // <- reference to matrix B on device
                                     tensor_c,  // <- reference to matrix C on device
                                     tensor_d,  // <- reference to matrix D on device
                                     {alpha, beta},          // <- tuple of alpha and beta
                                     split_k_slices};        // <- k-dimension split factor

Then I launch the CUTLASS kernel. But I would like to verify outputs. I don't know how to access the pointer to the output matrix from tensor_d ? Do I need to copy tensor_d from device to host ?

Thanks in advance,

hwu36 commented 3 years ago

you can get the device side pointer by using tensor_d.data(), you can change it in the device side by using this pointer, or copy it back to host to make the change in the host side.

ju9379 commented 3 years ago

Hi @hwu36, Thanks, the gemm runs fine. I used threadblock tile size 64x64x32, warp tile size 32x32x32 and instruction 8x8x4 like you recommended. But I still have some concerns with this configuration.

  1. I don't understand how the gemm decomposition works when we have an input problem size smaller than the threadblock tile size? Is padding automatically added to complete the threadblock tile size and respect the gemm cutlass decomposition hierarchy (link) or the decomposition is different and limited at the warp or instruction level?

  2. Considering the first case, the yield would be awful with so much zeros added. As I said earlier, I really would like to have at least a K gemm tile dimension to 8. I do not need big square tiles but flat ones. Is it possible to only use the instruction level? I could reconstruct the output myself after.. Do you think I need to use lower level implementations, or to consider attacking directly the tensor cores?

Thanks in advance,

hwu36 commented 3 years ago

The flow of a normal gemm is

When a global address is out of bound, the load is not going to happen, but 0 is written to the shared memory. Since your N is 8, so many threads in the threadblock are not loading data which is a waste for your memory bound kernel. Moreover, you are only using 1 SM, the rest 79 is idle.

However, volta tensor core is not easy to use as turing/amper, cutlass requires the warp size to be as small as 32x32 (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/warp/mma_tensor_op_sm70.h#L125). Maybe, you can try to set the threadblock tile size to be the same as the warp tile size which is 32x32. It may or may not work.

You can also try to use splitk. if you set split slices is 4, you can use 4x SMs.

If you really wants to get your hands dirty, you can directly use 8x8x4 tensor core ptx instructions in a different way than cutlass.

ocwins commented 3 years ago

If you really wants to get your hands dirty, you can directly use 8x8x4 tensor core ptx instructions in a different way than cutlass.

Hi @hwu36,

I have a question about these ptx instructions.

Do two m16n8k16 have similar throughput as one m16n8k32?

if I have a GEMM problem 16x8x16, to calculate it, does m16n8k16 take only half of the time m16n8k32 takes ?

hwu36 commented 3 years ago

Are you saying your problem size is just 16x8x16? If so, just load the data directly to the registers and then call 16x8x16 tensor core ptx. You don't really need cutlass.

In the normal cases, just always use the biggest tensor core instructions.

Andrea-usrgit commented 3 years ago

Hi,

I get the same problem exposed previously by Ju9379. I would like to use Cutlass Gemm for small matrices (16x16x16). Can i use this with a variable type of half by including cuda_fp16? Is it necessary to use cutlass to make matrix multiplication with tensor cores?

ju9379 commented 3 years ago

Hi @hwu36, Thanks, your answer is clear. I am considering getting my hands dirty as it seems less of a waste.

I have a last question to satisfy my curiosity. Given the threadblock tile dimension, CUTLASS is suitable to process GEMM on big matrixes. I wonder which applications require the processing of GEMM with a big K dimension ? (MxNxK)

@Andrea-usrgit, I am sure hwu36 will give an answer from his better experience. I think you could use directly WMMA, it supports 16x16x16, 32x8x16, 16x8x32 gemm (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-type-sizes). In my case it is not appropriate because the K dimension is 16 and not 8.

@hwu36 thanks again for the help you provided on this chat!

hwu36 commented 3 years ago

@ju9379

I wonder which applications require the processing of GEMM with a big K dimension ? (MxNxK)

Most applications require not-tiny K. For example, bert requires 1k - 4k elements in the k dimensions.

@Andrea-usrgit

Can i use this with a variable type of half by including cuda_fp16?

Yes, use cutlass::half_trather than cuda_fp16directly. cutlass version can work with the host side code.

Is it necessary to use cutlass to make matrix multiplication with tensor cores?

You don't have to use all the cutlass threadblock, warp level code to do 16x16x16. You can just use ptx mma directly under directory /include/cutlass/arch. Don't use wmma because I think it is slower than using the normal cutlass gemm kernels even for the tiny problem size.

ju9379 commented 3 years ago

Hi @hwu36, My cutlass gemm has 12% achieved occupancy.
Reminder of my gemm parameters: threadblock tile M = 64, N = 64, K = 32 warp tile M = 32, N = 32, K = 32 MMA Op tile M = 8, N = 8, K = 4 problem size = 64 x 128 x 8

I thought I could run concurrently multiple kernels of this gemm on different streams. I understood that gemm_op() (the function that runs cutlass gemm) was a functor. So I use gemm_op(stream).

cudaStream_t streamA; cudaStreamCreateWithFlags(&streamA, cudaStreamNonBlocking);
cudaStream_t streamB; cudaStreamCreateWithFlags(&streamB, cudaStreamNonBlocking);
[...]
status = gemm_op.initialize(arguments, workspace.get(), streamA);
status = gemm_op.initialize(arguments, workspace.get(), streamB);
[...]
[cudaMemcpyAsyncs on stream A]
[cudaMemcpyAsyncs on stream B]
[...]
gemm_op(streamA);
gemm_op(streamB);

NB : I used the same gemm class object for both.

The kernels doesn't run in parallel.

image

Could you give me indications to run concurrently cutlass gemms or could you provide an example ?

Thx again,

donglinz commented 3 years ago

@hwu36 what's the difference between mma and wmma? I mean, cutlass::arch::OpClassTensorOp and OpClassWmmaTensorOp? Does cutlass::arch::OpClassTensorOp use mma instructions under the hood whereas OpClassWmmaTensorOp use wmma instructions? Why wmma is slower than normal cutlass gemm kernels? Can you please elaborate a little bit more?

hwu36 commented 3 years ago

@ju9379 , your kernel is too tiny which is much smaller than the kernel launch overhead. Maybe you can try use cudagraph to reduce some of the overhead. Or can you use batched gemm to get bigger problem size?

@donglinz , cutlass::arch::OpClassTensorOp uses mma ptx and OpClassWmmaTensorOp uses wmma api. wmma is easy to use but does not use all the optimizatoins we have when using mma ptx. No pain no gain. Just a tradeoff.

ocwins commented 3 years ago

Are you saying your problem size is just 16x8x16? If so, just load the data directly to the registers and then call 16x8x16 tensor core ptx. You don't really need cutlass.

In the normal cases, just always use the biggest tensor core instructions.

@hwu36, no, my problem size is bigger but not bigger enough to use all SMs. So I think I can split it into 8x8x16 or 16x8x16, instead of 16x8x32 ( and execute only one tensor op in each warp), then I could distribute computation to more SMs.

But I don't know if smaller tensor core instructions are faster and may overall performance get better in such way.

hwu36 commented 3 years ago

If you tile is small, you just need a few threads and registers in your threadblock. The life of one threadblock is also very short. The device can launch multiple threadblocks on the same SM, you still have lots of SMs idle.

ju9379 commented 3 years ago

Hi @hwu36, Thanks. I could use batched_gemm.cu example but I still need to use Tensor Cores... Would it be correct to change all tensors to cutlass::half instead of float and modify the code like this ?

cudaError_t cutlass_strided_batched_sgemm(
  int m, 
  int n,
  int k,
  float alpha,
  cutlass::half_t const *A,
  int lda,
  long long int batch_stride_A,
  cutlass::half_t const *B,
  int ldb,
  long long int batch_stride_B,
  cutlass::half_t *C,
  int ldc,
  long long int batch_stride_C,
  float beta,
  int batch_count) {

  using Gemm = cutlass::gemm::device::GemmBatched<
    cutlass::half_t, // ElementInputA
    cutlass::layout::ColumnMajor, // LayoutInputA
    cutlass::half_t, // ElementInputB
    cutlass::layout::ColumnMajor, // LayoutInputB
    cutlass::half_t, // ElementOutput 
    cutlass::layout::ColumnMajor, // LayoutOutput,
    cutlass::half_t, // ElementAccumulator
    cutlass::arch::OpClassTensorOp, // MMAOp
    cutlass::arch::Sm70, // SmArch
    cutlass::gemm::GemmShape<64, 64, 32>, // ShapeMMAThreadBlock
    cutlass::gemm::GemmShape<32, 32, 32>, // ShapeMMAWarp
    cutlass::gemm::GemmShape<8, 8, 4>,  // ShapeMMAOp
    cutlass::epilogue::thread::LinearCombination<ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value, ElementAccumulator, ElementComputeEpilogue>, // EpilogueOp
    cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, // SwizzleThreadBlock
    2 // NumStages 
  >;

  Gemm gemm_op;

  cutlass::Status status = gemm_op({
    {m, n, k},
    {A, lda}, 
    batch_stride_A,
    {B, ldb}, 
    batch_stride_B,
    {C, ldc}, 
    batch_stride_C,
    {C, ldc}, 
    batch_stride_C,
    {alpha, beta},
    batch_count
  });

  if (status != cutlass::Status::kSuccess) {
    return cudaErrorUnknown;
  }

  return cudaSuccess;
}
hwu36 commented 3 years ago

@ju9379 , you can use this.

Or you can use gemm_universal. Set the mode to GemmUniversalMode::kBatched. Example 18 uses gemm_universal.

ocwins commented 3 years ago

@hwu36

If you tile is small, you just need a few threads and registers in your threadblock. The life of one threadblock is also very short. The device can launch multiple threadblocks on the same SM, you still have lots of SMs idle.

Got it.

How many cycles are needed for tensor core instructions? (especially int8 16x8x32) I'm trying to find out a suitable layout/mma combination for my problems.

hwu36 commented 3 years ago

Small GEMM is memory bounded, math operation is not in the critical path. You just need to make sure launching as many threads as needed to maximize the memory bandwidth.