[QST] Limiting GEMM and specifying stream

nnaron commented 1 year ago

I have the following code for GEMM on A100 and I need to add the following feature to the code

1. adding a stream instead of the default stream Is this correct?

cutlass::Status status = gemm_operator(stream);

2. Changing the resource utilization to make free a part of the GPU. I mean limiting the GEMM kernel to have some free register, shared memory, CUDA core, or SM.

Dose cutlass::gemm::GemmShape help?

#include <cutlass/gemm/device/gemm.h>

/* 
 * Compute C = A*B using cutlass
*/
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn(
  int M,
  int N,
  int K,
  double *ptr_alpha,
  double *A,
  int lda,
  double *B,
  int ldb,
  double *ptr_beta,
  double *C,
  int ldc) {

  double alpha = *ptr_alpha;
  double beta = *ptr_beta;

  // Define type definition for double-precision CUTLASS GEMM with column-major
  // input matrices and 128x128x8 threadblock tile size (chosen by default).
  //
  // To keep the interface manageable, several helpers are defined for plausible compositions
  // including the following example for double-precision GEMM. Typical values are used as
  // default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
  //
  // To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`

  using ElementOutput = double;
  using ElementAccumulator = double;
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using RowMajor = cutlass::layout::RowMajor;
  /*
  using CutlassGemm = cutlass::gemm::device::Gemm<
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    double,        // Data-type of C matrix
    ColumnMajor>; // Layout of C matrix

  */
  using CutlassGemm = cutlass::gemm::device::Gemm< 
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    ElementOutput,        // Data-type of C matrix
    ColumnMajor,    // Layout of C matrix    , LayoutC = layout::ColumnMajor;                       
    ElementAccumulator,                                     // ElementAccumulator
    cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
    cutlass::arch::Sm80,                        // tag indicating target GPU compute architecture
    cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
    /*
    cutlass::epilogue::thread::LinearCombination<
      ElementOutput,
      1,
      ElementAccumulator,
      ElementAccumulator
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    4*/
  >; 

  // Define a CUTLASS GEMM type
  CutlassGemm gemm_operator;
  // Construct the CUTLASS GEMM arguments object.
  //
  // One of CUTLASS's design patterns is to define gemm argument objects that are constructible
  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
  // and other arguments needed by Gemm and its components.
  //
  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
  //
  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
                              {A, lda},    // Tensor-ref for source matrix A
                              {B, ldb},    // Tensor-ref for source matrix B
                              {C, ldc},    // Tensor-ref for source matrix C
                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
                              {alpha, beta}); // Scalars used in the Epilogue
  //
  // Launch the CUTLASS GEMM kernel.
  //
  cutlass::Status status = gemm_operator(args);
  //
  // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
  //
  if (status != cutlass::Status::kSuccess) {
    return cudaErrorUnknown;
  }
  // Return success, if no errors were encountered.
  return cudaSuccess;
}

nnaron commented 1 year ago

Is following code correct to add support for non default stream ?

#include <cutlass/gemm/device/gemm.h>

/* 
 * Compute C = A*B using cutlass
*/
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn(
cudaStream_t *stream_cut,
  int M,
  int N,
  int K,
  double *ptr_alpha,
  double *A,
  int lda,
  double *B,
  int ldb,
  double *ptr_beta,
  double *C,
  int ldc) {

  double alpha = *ptr_alpha;
  double beta = *ptr_beta;

  // Define type definition for double-precision CUTLASS GEMM with column-major
  // input matrices and 128x128x8 threadblock tile size (chosen by default).
  //
  // To keep the interface manageable, several helpers are defined for plausible compositions
  // including the following example for double-precision GEMM. Typical values are used as
  // default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
  //
  // To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`

  using ElementOutput = double;
  using ElementAccumulator = double;
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using RowMajor = cutlass::layout::RowMajor;
  /*
  using CutlassGemm = cutlass::gemm::device::Gemm<
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    double,        // Data-type of C matrix
    ColumnMajor>; // Layout of C matrix

  */
  using CutlassGemm = cutlass::gemm::device::Gemm< 
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    ElementOutput,        // Data-type of C matrix
    ColumnMajor,    // Layout of C matrix    , LayoutC = layout::ColumnMajor;                       
    ElementAccumulator,                                     // ElementAccumulator
    cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
    cutlass::arch::Sm80,                        // tag indicating target GPU compute architecture
    cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
    /*
    cutlass::epilogue::thread::LinearCombination<
      ElementOutput,
      1,
      ElementAccumulator,
      ElementAccumulator
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    4*/
  >; 

  // Define a CUTLASS GEMM type
  CutlassGemm gemm_operator;
  cutlass::Status status = gemm_operator(stream_cut);
  // Construct the CUTLASS GEMM arguments object.
  //
  // One of CUTLASS's design patterns is to define gemm argument objects that are constructible
  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
  // and other arguments needed by Gemm and its components.
  //
  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
  //
  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
                              {A, lda},    // Tensor-ref for source matrix A
                              {B, ldb},    // Tensor-ref for source matrix B
                              {C, ldc},    // Tensor-ref for source matrix C
                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
                              {alpha, beta}); // Scalars used in the Epilogue
  //
  // Launch the CUTLASS GEMM kernel.
  //

  cutlass::Status status = gemm_operator(args);
  //
  // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
  //
  if (status != cutlass::Status::kSuccess) {
    return cudaErrorUnknown;
  }
  // Return success, if no errors were encountered.
  return cudaSuccess;
}

hwu36 commented 1 year ago

you can pass your own stream like what you did.

Bigger tile size uses more registers and shared memory per threadblock, but less thread blocks.

nnaron commented 1 year ago

Bigger tile size uses more registers and shared memory per threadblock, but less thread blocks.

So does it means that just occupancy will increase and again all registers and shared memory will be used?

hwu36 commented 1 year ago

bigger tile size -> more registers and shared memory usage -> smaller occupancy which means less thread blocks can run on the same SM

mnicely commented 1 year ago

@nnaron is your issue resolved?

nnaron commented 1 year ago

@mnicely I do not have access to the GPU for testing. Let me check it before closing.

nnaron commented 1 year ago

Hi

For adding the stream I added like following line:

cutlass::Status status = gemm_operator(&stream_cut);

  // Define a CUTLASS GEMM type
  CutlassGemm gemm_operator;
  cutlass::Status status = gemm_operator(stream_cut);
  // Construct the CUTLASS GEMM arguments object.
  //
  // One of CUTLASS's design patterns is to define gemm argument objects that are constructible
  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
  // and other arguments needed by Gemm and its components.
  //
  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
  //
  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
                              {A, lda},    // Tensor-ref for source matrix A
                              {B, ldb},    // Tensor-ref for source matrix B
                              {C, ldc},    // Tensor-ref for source matrix C
                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
                              {alpha, beta}); // Scalars used in the Epilogue

And I am seeing this error:


error: no instance of overloaded function "cutlass::gemm::device::Gemm<ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, cutlass::layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, GatherA, GatherB, ScatterD, PermuteDLayout>::operator() [with ElementA_=double, LayoutA_=cutlass::layout::ColumnMajor, ElementB_=double, LayoutB_=cutlass::layout::ColumnMajor, ElementC_=double, ElementAccumulator_=double, OperatorClass_=cutlass::arch::OpClassTensorOp, ArchTag_=cutlass::arch::Sm80, ThreadblockShape_=cutlass::gemm::GemmShape<64, 64, 16>, WarpShape_=cutlass::gemm::GemmShape<32, 32, 16>, InstructionShape_=cutlass::gemm::GemmShape<8, 8, 4>, EpilogueOutputOp_=cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, ThreadblockSwizzle_=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Stages=3, AlignmentA=1, AlignmentB=1, SplitKSerial=false, Operator_=cutlass::arch::OpMultiplyAdd, GatherA=false, GatherB=false, ScatterD=false, PermuteDLayout=cutlass::layout::NoPermute]" matches the argument list
            argument types are: (cudaStream_t *)
            object type is: CutlassGemm

hwu36 commented 1 year ago

see the parameter list here https://github.com/NVIDIA/cutlass/blob/7c04f954151f606e60608061e891785fba229ae2/include/cutlass/gemm/device/gemm.h#L507-L510

nnaron commented 1 year ago

bigger tile size -> more registers and shared memory usage -> smaller occupancy which means less thread blocks can run on the same SM

Thanks. Following configuration is considered as a big tile size?


        using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
        using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;
        static int const kEpilogueElementsPerAccess = 1;
        using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

For my purpose which kind of shape has more impact (keep some SM free or being able to run GEMM with other small kernel)? ThreadblockShape, WarpShape or InstructionShape.

hwu36 commented 1 year ago

your config is pretty small.

then you will use more threadblock with less occupancy.

running multiple very small kernels in many different streams in parallel is tricky. it is possible that kernels finish before the others start and no kernel work in parallel.

nnaron commented 1 year ago

I have one stream for GEMM and second stream for another kernel different than GEMM.

With playing with above configuration just the grid size and number of threads are changing, and second stream is not running anything.

What abot this config?

        using ThreadblockShape = cutlass::gemm::GemmShape<64, 64, 8>;
        using WarpShape = cutlass::gemm::GemmShape<16, 16, 8>;
        static int const kEpilogueElementsPerAccess = 1;
        using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

Now the occupancy is 50% (first config was 23.4%), but not advantage for running second kernel.

hwu36 commented 1 year ago

stream scheduling is in hardware/driver. when the scheduler seems it is not possible or not necessary or not beneficial to run two streams in parallel, the streams are not going to run in parallel.

using streams is not always faster especially when kernels are too big or too small.

tile sizes can only control the number of threadblocks and occupancy. the problem size will also impact the runtime of the kernel. if the runtime is too short, nothing will run in parallel.

nnaron commented 1 year ago

Thanks for support. Could you please give me a hint to understand how configuration is translated to grid and block size that I am seeing in nsys trace?

Also why the occupancy for last two config is equal?

        using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
        using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;
        static int const kEpilogueElementsPerAccess = 1;
        using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

        using ThreadblockShape = cutlass::gemm::GemmShape<64, 64, 8>;
        using WarpShape = cutlass::gemm::GemmShape<16, 16, 8>;
        static int const kEpilogueElementsPerAccess = 1;
        using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

        using ThreadblockShape = cutlass::gemm::GemmShape<128, 32, 8>;
        using WarpShape = cutlass::gemm::GemmShape<64, 16, 8>;
        static int const kEpilogueElementsPerAccess = 1;
        using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;

hwu36 commented 1 year ago

the nsys of the last two show the same number of threads, same shared memory usage, same register usage. so the occupancy are the same.

hwu36 commented 1 year ago

block size: <threadblock_x / warp_x, threadblock_y / warp_y, threadblock_z / warp_z> grid size: <problem_x / threadblock_x, problem_y / threadblock_y, problem_z / threadblock_z>

nnaron commented 1 year ago

So for folowing config, the block size should be <1,1,1>?

16/16=1 32/32=1 8/8=1

        using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
        using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;

I am seeing <32,1,1> in the trace.

the nsys of the last two show the same number of threads, same shared memory usage, same register usage. so the occupancy are the same.

But the config is different, why block size is equal?

hwu36 commented 1 year ago

you could dump from https://github.com/NVIDIA/cutlass/blob/df02482f1d429281bf7dda498259939cf0e79d06/include/cutlass/gemm/device/gemm.h#L477-L478 to debug

nnaron commented 1 year ago

@hwu36 Thank you for your patience and answering my questions.

When you are talking about the tile, which tile do you mean? Thread Block Tile, Warp Tile, or Thread Tile?
I did not find GemmKernel::kThreadCountsource code. In which part of the repository it is?
What is the idea behind keeping the second and third dimension of block size equal to 1?
If CUTLASS by itself is changing the grid and block size, why do we are writing configuration?
The configuration that we are writing is referring to which part of this path?

In all example from test/unit/gemm/device/simt_dgemm_tt_sm50.cu The following part of the configuration is the same and 1 is repeating. What does it mean exactly?
```
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
```

hwu36 commented 1 year ago

When you are talking about the tile, which tile do you mean? Thread Block Tile, Warp Tile, or Thread Tile?

I mean all of them.

I did not find GemmKernel::kThreadCountsource code. In which part of the repository it is?

https://github.com/NVIDIA/cutlass/blob/fcfbd23e26328df8c8b720a161ada95a3eb725e8/include/cutlass/gemm/kernel/gemm.h#L69

grep is useful in reading the code.

What is the idea behind keeping the second and third dimension of block size equal to 1?

It means we just need one warp or threadblock to handle N or K dimension data.

If CUTLASS by itself is changing the grid and block size, why do we are writing configuration?

grid size is decided by problem size and threadblock tile size. block size is decided by threadblock tile size and warp tile size. Tile size impacts the performance.

The configuration that we are writing is referring to which part of this path?

Those inside the green box

The following part of the configuration is the same and 1 is repeating. What does it mean exactly?

using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>; means every thread uses every instruction to handle 1x1x1 data each time

kEpilogueElementsPerAccess = 1 means every thread outputs one element in every store during the epilogue.

Our first two GTC talks several years ago discussed these concepts in detail.

nnaron commented 1 year ago

As I understand memory resources (Shared Memory and Registers) are limiting the GEMM. So if occupancy is 12.5% then all shared memory and registers are under utilization by 12.5% of warps. During that time what is happening for the rest 87.5% of warps? Are they completely idle?

mnicely commented 1 year ago

Yes, they will be idle. To better understand what is going on at that detail. You can use Nsight Compute. https://developer.nvidia.com/nsight-compute

There are many resources such as videos and blogs to help you better understand resource utilization

hwu36 commented 1 year ago

HW can schedule multiple threadblocks on the same SM if the scheduler thinks it is good.

nnaron commented 1 year ago

Thanks Matthew, but I do not have access Nsight Compute.

I have a question about writing configuration. Is it based on the shared memory or Max thread block size (1024). I found several in examples related to A100, but the last two of them successfully compiled without properly executing the GEMM operation.

// Threadblock tile shape // Warp tile shape // TensorCore instruction shape

C1:🟢

                cutlass::gemm::GemmShape<128, 128, 16>, // Shape to optimize
                cutlass::gemm::GemmShape<32, 64, 16>, // Shape to optimize
                cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize

C2: 🟢

                cutlass::gemm::GemmShape<128, 256, 128>, // Shape to optimize
                cutlass::gemm::GemmShape<64, 64, 128>, // Shape to optimize
                cutlass::gemm::GemmShape<16, 8, 4> // Shape to optimize

C3: 🔴

                cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
                cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
                cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize

C4: 🔴

               cutlass::gemm::GemmShape<128, 256, 64>, // Shape to optimize
                cutlass::gemm::GemmShape<64, 64, 64>, // Shape to optimize
                cutlass::gemm::GemmShape<16, 8, 16> // Shape to optimize

Also if I mention following line the FP64 GEMM will be run by CUDA Core for sure?

// cutlass::arch::OpClassSimt, // tag indicating Cuda Cores

Last question. Is it possible to change the size of the initial GEMM block size? I mean the first stage at following pic.

237757013-27708761-bb8b-458a-a3b7-0b745984fc5c

hwu36 commented 1 year ago

https://github.com/NVIDIA/cutlass/blob/main/tools/library/scripts/generator.py#L3274, use these plausible tile sizes for ampere fp64 tensor core gemms

I recommend use OpClassTensorOp explicitly

nnaron commented 1 year ago

sorry for silly question, but I am not understanding following lines from your link (What are the Warp tile shape and TensorCore instruction shape)?

What is the min_cc and max_cc?
Is [128, 128, 16] the Threadblock tile shape? then what is the Warp tile shape and instruction shape size?
3 as the second parameter is referencing to what?

  min_cc = 80
  max_cc = 1024

  alignment_constraints = [1,]

  tile_descriptions = [
    TileDescription([128, 128, 16], 3, [4, 2, 1], math_inst, min_cc, max_cc),
  ]

nnaron commented 1 year ago

In which part we are configuring the Split K (reduction across threadblocks) and Sliced K (reduction across warps)

hwu36 commented 1 year ago

128x128x16 is threadblock size 3 is stage number 128/4, 128/2, 16/1 is warp size 8x8x4 is instruction size

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemm_universal.h#L199 controls splitk slices

if you want to use splicek, set warp k be smaller than threadblock k in tile size.

nnaron commented 1 year ago

splicek is misspelling or related to splitk slicesk both?

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemm_universal.h#L199 controls splitk

Do we have any example in configuration or I have to compile CUTLASS again?

hwu36 commented 1 year ago

splicek is misspelling or related to splitk slicesk both?

not misspeliing. if your threadblock k is 64 and warp k is 32. you slice threadblock k into 2 halves.

Do we have any example in configuration or I have to compile CUTLASS again?

Just set the variable to the splitk number you want.

nnaron commented 1 year ago

Do we have any example to learn how to use GemmUniversal? I have no idea about replacing the previus GEMM with this one.

mnicely commented 1 year ago

https://github.com/search?q=repo%3ANVIDIA%2Fcutlass+gemmuniversal+path%3A%2F%5Eexamples%5C%2F%2F&type=code

nnaron commented 1 year ago

Thank you. In following link

https://github.com/NVIDIA/cutlass/blob/6fbc0d33800008d3180d3fefed4e1a653e5f72a0/examples/39_gemm_permute/gemm_permute.cu#LL779C4-L779C4

where is the defenition of batch_count?

nnaron commented 1 year ago

Here is an instantiation of GemmUniversal based on the: https://github.com/NVIDIA/cutlass/blob/6fbc0d33800008d3180d3fefed4e1a653e5f72a0/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk.cu#L158


using precision = double;
        constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<precision>::value;
        constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<precision>::value;
        constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<precision>::value;

    using ArchTag             = cutlass::arch::Sm80;                      // Tag indicating the minimum SM that supports the intended feature
    using OperatorClass       = cutlass::arch::OpClassTensorOp;           // Operator class tag
    using ThreadblockShape    = cutlass::gemm::GemmShape<128, 128, 32>;   // Threadblock-level tile size (concept: GemmShape)
    using WarpShape           = cutlass::gemm::GemmShape<64, 64, 32>;     // Warp-level tile size (concept: GemmShape)
    using InstructionShape    = cutlass::gemm::GemmShape<16, 8, 16>;      // Instruction-level tile size (concept: GemmShape)
    constexpr int NumStages   = 4;

    using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
            precision,               // Element type for C and D matrix operands
            AlignmentC,             // Memory access granularity of C and D matrix in units of elements
            precision,     // Element type from internal accumaccumulation
            precision>;    // Data type used to compute linear combination

    precision, cutlass::layout::ColumnMajor,
            precision,
            OperatorClass,
            ArchTag,
            ThreadblockShape,
            WarpShape,
            InstructionShape,
            EpilogueOp,
            cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
            NumStages,
            AlignmentA,
            AlignmentB>;

But I am seeing following error:

include/cutlass/transform/threadblock/predicated_tile_access_iterator.h(356): error: static assertion failed with "Vectors implied by the thread map must be divisible by the access type."
    static_assert(!(ThreadMap::kElementsPerAccess % AccessType::kElements),
    ^
          detected during:
            instantiation of class "cutlass::transform::threadblock::PredicatedTileAccessIterator<Shape_, Element_, cutlass::layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_, Gather> [with Shape_=cutlass::PitchLinearShape<32, 128>, Element_=double, AdvanceRank=0, ThreadMap_=cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, AccessType_=cutlass::Array<double, 2, true>, Gather=false]" at line 882
            instantiation of class "cutlass::transform::threadblock::PredicatedTileAccessIterator<Shape_, Element_, cutlass::layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_, Gather> [with Shape_=cutlass::MatrixShape<128, 32>, Element_=double, AdvanceRank=1, ThreadMap_=cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, AccessType_=cutlass::Array<double, 2, true>, Gather=false]" at line 155 of /home/sw/cutlass/include/cutlass/gemm/kernel/gemm.h
            instantiation of class "cutlass::gemm::kernel::Gemm<Mma_, Epilogue_, ThreadblockSwizzle_, SplitKSerial> [with Mma_=cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, double, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, cutlass::Array<double, 2, true>, false>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, 8>, cutlass::arch::CacheOperation::Always, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, double, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpStripedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<16, 2>, 1>, cutlass::Array<double, 2, true>, false>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, 0, cutlass::transform::PitchLinearWarpStripedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<16, 2>, 1>, 8>, cutlass::arch::CacheOperation::Always, double, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, double, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, cutlass::gemm::SharedMemoryClearOption::kNone, bool>, Epilogue_=cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, double, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 2, 64>, double, false, cutlass::layout::NoPermute, true>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, double, cutlass::Array<double, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, double, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 2, 64>::CompactedThreadMap, double, 16>, cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, cutlass::MatrixShape<0, 4>, 1, 1>, ThreadblockSwizzle_=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, SplitKSerial=true]" at line 235 of /home/wp/cutlass/include/cutlass/gemm/kernel/default_gemm_universal.h
            instantiation of class "cutlass::gemm::kernel::DefaultGemmUniversal<ElementA, LayoutA, cutlass::ComplexTransform::kNone, kAlignmentA, ElementB, LayoutB, cutlass::ComplexTransform::kNone, kAlignmentB, ElementC, LayoutC, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, Stages, Operator, SharedMemoryClear, GatherA, GatherB, ScatterD, PermuteDLayout, std::enable_if<<expression>, void>::type>::SelectBase<SwizzleT, Enable> [with ElementA=double, LayoutA=cutlass::layout::RowMajor, kAlignmentA=2, ElementB=double, LayoutB=cutlass::layout::RowMajor, kAlignmentB=2, ElementC=double, LayoutC=cutlass::layout::RowMajor, ElementAccumulator=double, OperatorClass=cutlass::arch::OpClassTensorOp, ArchTag=cutlass::arch::Sm80, ThreadblockShape=cutlass::gemm::GemmShape<128, 128, 32>, WarpShape=cutlass::gemm::GemmShape<64, 64, 32>, InstructionShape=cutlass::gemm::GemmShape<16, 8, 16>, EpilogueOutputOp=cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, ThreadblockSwizzle=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Stages=4, Operator=cutlass::arch::OpMultiplyAdd, SharedMemoryClear=cutlass::gemm::SharedMemoryClearOption::kNone, GatherA=false, GatherB=false, ScatterD=false, PermuteDLayout=cutlass::layout::NoPermute, SwizzleT=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Enable=void]" at line 68 of /home/wp/cutlass/include/cutlass/gemm/device/gemm_universal_base.h

nnaron commented 1 year ago

Finally I did it but I am not seeing the CUTLASS kernel in the trace of nsys when batch_count > 1. What is the problem?


        using CutlassGemm = cutlass::gemm::device::GemmUniversal<
                ElementA, LayoutA,
                ElementB, LayoutB,
                ElementOutput,   LayoutC,
                ElementAccumulator,
                cutlass::arch::OpClassTensorOp,
                cutlass::arch::Sm80,
                cutlass::gemm::GemmShape<64, 64, 16>,
                cutlass::gemm::GemmShape<32, 32, 16>,
                cutlass::gemm::GemmShape<8, 8, 4>,
                cutlass::epilogue::thread::LinearCombination<
                        ElementOutput,
                128 / cutlass::sizeof_bits<ElementOutput>::value,
                ElementAccumulator,
                ElementAccumulator
                        >,
                cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
                4
                        >;

        typename CutlassGemm::Arguments args{
                                cutlass::gemm::GemmUniversalMode::kGemm,  // universal mode
                                problem_size,                     // problem_size
                                2,                   // batch count / splitk slices
                                {alpha, beta},
                                A,                   // ptr_A
                                B,                   // ptr_B
                                C,                   // ptr_C
                                C,                   // ptr_D
                                int64_t(),      // batch_stride_A
                                int64_t(),      // batch_stride_B
                                int64_t(),      // batch_stride_C
                                int64_t(),      // batch_stride_D
                                int64_t(lda),              // stride_a
                                int64_t(ldb),              // stride_b
                                int64_t(ldc),              // stride_c
                                int64_t(ldc)             // stride_d
        };

hwu36 commented 1 year ago

maybe, not related.

you need to change

128 / cutlass::sizeof_bits<ElementOutput>::value to 1

fp64 gemm uses alignment 1 everywhere.

nnaron commented 1 year ago

Thanks @hwu36, but no change. Just I am seeing cudaLaunchKernel.

CUTLASS_gemm_batch

What do you think about following part? Actualy I got that config from a FP16_FP16_FP32 example.

                cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
                4

hwu36 commented 1 year ago

What do you think about following part? that is fine.

insert some printf to track what is going on

nnaron commented 1 year ago

By checking the status, I observed that the status is not Success only when the batch count is greater than 1. How can we gather more information about the failures that occur when the batch count is larger? What exactly happens with larger batch counts that leads to these failures?

      cutlass::Status status = gemm_operator(args, nullptr, stream_cut);
        // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
        if (status != cutlass::Status::kSuccess) {
               return cudaErrorUnknown;
      }

By adding following code I am not seeing any error from CUDA part, so just it is related to CUTLASS:

                 cudaError_t cudaError = cudaGetLastError();
                 const char* cudaErrorString = cudaGetErrorString(cudaError);
                 std::cout << "CUDA error: " << cudaErrorString << std::endl;

Also by followinng part:

                 const char* cutlassStatusString = cutlassGetStatusString(status);
                 std::cout << "Cutlass error: " << cutlassStatusString << std::endl;

The error is:

Cutlass error: Error Internal

Here is the code that I am using for GEMM.

#include <cutlass/gemm/device/gemm.h>

#include "cutlass/cutlass.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/gemm/kernel/gemm_grouped.h"
#include "cutlass/gemm/kernel/default_gemm_grouped.h"
#include "cutlass/gemm/device/gemm_grouped.h"
#include "cutlass/gemm/device/gemm_universal.h"

#include "cutlass/util/command_line.h"
#include "cutlass/util/distribution.h"
#include "cutlass/util/device_memory.h"
#include "cutlass/util/tensor_view_io.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/reference/host/gemm_complex.h"
#include "cutlass/util/reference/device/gemm_complex.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_copy.h"
#include "cutlass/util/reference/device/tensor_fill.h"
#include "cutlass/util/reference/host/tensor_norm.h"

/* 
 * Compute C = A*B using cutlass
 */
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn_mod(
        cudaStream_t stream_cut,
        int M,
        int N,
        int K,
        double *ptr_alpha,
        double *A,
        int lda,
        double *B,
        int ldb,
        double *ptr_beta,
        double *C,
        int ldc) {

    double alpha = *ptr_alpha;
    double beta = *ptr_beta;

    // Define type definition for double-precision CUTLASS GEMM with column-major
    // input matrices and 128x128x8 threadblock tile size (chosen by default).
    //
    // To keep the interface manageable, several helpers are defined for plausible compositions
    // including the following example for double-precision GEMM. Typical values are used as
    // default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
    //
    // To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`

    using precision = double;
    static int const kEpilogueElementsPerAccess = 1;
    using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombination<
        precision, kEpilogueElementsPerAccess, precision, precision>;

    using CutlassGemm_1  = cutlass::gemm::device::Gemm<
        precision,                                 // ElementA
        cutlass::layout::ColumnMajor,              // LayoutA
        precision,                                 // ElementB
        cutlass::layout::ColumnMajor,              // LayoutB
        precision,                                 // ElementOutput
        cutlass::layout::ColumnMajor,              // LayoutOutput
        precision,                                 // ElementAccumulator
        cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
        cutlass::arch::Sm80,                       // tag indicating target GPU compute architecture
        cutlass::gemm::GemmShape<64, 64, 16>,      // Shape to optimize
        cutlass::gemm::GemmShape<32, 32, 16>,      // Shape to optimize
        cutlass::gemm::GemmShape<8, 8, 4>          // Shape to optimize
            >;

    using ElementA = double;
    using ElementB = double;
    using ElementOutput = double;
    using ElementAccumulator = double;

    using LayoutA = cutlass::layout::ColumnMajor;
    using LayoutB = cutlass::layout::ColumnMajor;
    using LayoutC = cutlass::layout::ColumnMajor;

    // Mainn Gemm operator 
    using CutlassGemm = cutlass::gemm::device::GemmUniversal<
        ElementA, LayoutA,
        ElementB, LayoutB,
        ElementOutput,   LayoutC,
        ElementAccumulator,
        cutlass::arch::OpClassTensorOp,
        cutlass::arch::Sm80,
        cutlass::gemm::GemmShape<64, 64, 16>,
        cutlass::gemm::GemmShape<32, 32, 16>,
        cutlass::gemm::GemmShape<8, 8, 4>,
        cutlass::epilogue::thread::LinearCombination<
            ElementOutput,
        1,//128 / cutlass::sizeof_bits<ElementOutput>::value,
        ElementAccumulator,
        ElementAccumulator
            >,
        cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
        4
            >;

    // Define a CUTLASS GEMM type
    CutlassGemm gemm_operator;
    cutlass::gemm::GemmCoord problem_size;

    problem_size.m()=M;
    problem_size.n()=N;
    problem_size.k()=K;

    typename CutlassGemm::Arguments args{
                cutlass::gemm::GemmUniversalMode::kGemm, // universal mode
                problem_size,                            // problem_size
                2,                                        // batch count / splitk slices
                {alpha, beta},
                A,                   // ptr_A
                B,                   // ptr_B
                C,                   // ptr_C
                C,                   // ptr_D
                int64_t(),      // batch_stride_A
                int64_t(),      // batch_stride_B
                int64_t(),      // batch_stride_C
                int64_t(),      // batch_stride_D
                int64_t(lda),              // stride_a
                int64_t(ldb),              // stride_b
                int64_t(ldc),              // stride_c
                int64_t(ldc)               // stride_d
    };

    cutlass::Status status = gemm_operator(args, nullptr, stream_cut/*nullptr*/);
    //
    // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
    //

    if (status != cutlass::Status::kSuccess) {
        printf("error in cutlass return\n");

        cudaError_t cudaError = cudaGetLastError(); 
        const char* cudaErrorString = cudaGetErrorString(cudaError);
        std::cout << "CUDA error: " << cudaErrorString << std::endl;

        return cudaErrorUnknown;
    }
    // Return success, if no errors were encountered.

    return cudaSuccess;
}

hwu36 commented 1 year ago

have you allocated memory for the workspace?

you can follow serial splitK in https://github.com/NVIDIA/cutlass/blob/main/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu to see how it works

nnaron commented 1 year ago

Thanks, I did as you mention, and it seems to work.

I was expected to see many small GEMMs running, but now by passing 64 as the batch_count, just I am seeing a GEMM kernel that runs 100x slower.

So what is the advantage of batch version?

hwu36 commented 1 year ago

if you have big k, small m/n. splitk can help you to saturate the gpu. if you over split or over saturate, you won't get benefit.

github-actions[bot] commented 1 year ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mnicely commented 1 year ago

@nnaron has your issue been resolved?

github-actions[bot] commented 1 year ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mnicely commented 1 year ago

Closing due to inactivity.

NVIDIA / cutlass

[QST] Limiting GEMM and specifying stream #922