Closed nnaron closed 1 year ago
Is following code correct to add support for non default stream ?
#include <cutlass/gemm/device/gemm.h>
/*
* Compute C = A*B using cutlass
*/
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn(
cudaStream_t *stream_cut,
int M,
int N,
int K,
double *ptr_alpha,
double *A,
int lda,
double *B,
int ldb,
double *ptr_beta,
double *C,
int ldc) {
double alpha = *ptr_alpha;
double beta = *ptr_beta;
// Define type definition for double-precision CUTLASS GEMM with column-major
// input matrices and 128x128x8 threadblock tile size (chosen by default).
//
// To keep the interface manageable, several helpers are defined for plausible compositions
// including the following example for double-precision GEMM. Typical values are used as
// default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
//
// To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`
using ElementOutput = double;
using ElementAccumulator = double;
using ColumnMajor = cutlass::layout::ColumnMajor;
using RowMajor = cutlass::layout::RowMajor;
/*
using CutlassGemm = cutlass::gemm::device::Gemm<
double, // Data-type of A matrix
ColumnMajor, // Layout of A matrix
double, // Data-type of B matrix
ColumnMajor, // Layout of B matrix
double, // Data-type of C matrix
ColumnMajor>; // Layout of C matrix
*/
using CutlassGemm = cutlass::gemm::device::Gemm<
double, // Data-type of A matrix
ColumnMajor, // Layout of A matrix
double, // Data-type of B matrix
ColumnMajor, // Layout of B matrix
ElementOutput, // Data-type of C matrix
ColumnMajor, // Layout of C matrix , LayoutC = layout::ColumnMajor;
ElementAccumulator, // ElementAccumulator
cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
cutlass::arch::Sm80, // tag indicating target GPU compute architecture
cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
/*
cutlass::epilogue::thread::LinearCombination<
ElementOutput,
1,
ElementAccumulator,
ElementAccumulator
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
4*/
>;
// Define a CUTLASS GEMM type
CutlassGemm gemm_operator;
cutlass::Status status = gemm_operator(stream_cut);
// Construct the CUTLASS GEMM arguments object.
//
// One of CUTLASS's design patterns is to define gemm argument objects that are constructible
// in host code and passed to kernels by value. These may include pointers, strides, scalars,
// and other arguments needed by Gemm and its components.
//
// The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
// arguments to kernels and (2.) minimized initialization overhead on kernel entry.
//
CutlassGemm::Arguments args({M , N, K}, // Gemm Problem dimensions
{A, lda}, // Tensor-ref for source matrix A
{B, ldb}, // Tensor-ref for source matrix B
{C, ldc}, // Tensor-ref for source matrix C
{C, ldc}, // Tensor-ref for destination matrix D (may be different memory than source C matrix)
{alpha, beta}); // Scalars used in the Epilogue
//
// Launch the CUTLASS GEMM kernel.
//
cutlass::Status status = gemm_operator(args);
//
// Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
//
if (status != cutlass::Status::kSuccess) {
return cudaErrorUnknown;
}
// Return success, if no errors were encountered.
return cudaSuccess;
}
you can pass your own stream
like what you did.
Bigger tile size uses more registers and shared memory per threadblock, but less thread blocks.
Bigger tile size uses more registers and shared memory per threadblock, but less thread blocks.
So does it means that just occupancy will increase and again all registers and shared memory will be used?
bigger tile size -> more registers and shared memory usage -> smaller occupancy which means less thread blocks can run on the same SM
@nnaron is your issue resolved?
@mnicely I do not have access to the GPU for testing. Let me check it before closing.
Hi
For adding the stream I added like following line:
cutlass::Status status = gemm_operator(&stream_cut);
// Define a CUTLASS GEMM type
CutlassGemm gemm_operator;
cutlass::Status status = gemm_operator(stream_cut);
// Construct the CUTLASS GEMM arguments object.
//
// One of CUTLASS's design patterns is to define gemm argument objects that are constructible
// in host code and passed to kernels by value. These may include pointers, strides, scalars,
// and other arguments needed by Gemm and its components.
//
// The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
// arguments to kernels and (2.) minimized initialization overhead on kernel entry.
//
CutlassGemm::Arguments args({M , N, K}, // Gemm Problem dimensions
{A, lda}, // Tensor-ref for source matrix A
{B, ldb}, // Tensor-ref for source matrix B
{C, ldc}, // Tensor-ref for source matrix C
{C, ldc}, // Tensor-ref for destination matrix D (may be different memory than source C matrix)
{alpha, beta}); // Scalars used in the Epilogue
And I am seeing this error:
error: no instance of overloaded function "cutlass::gemm::device::Gemm<ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, cutlass::layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, GatherA, GatherB, ScatterD, PermuteDLayout>::operator() [with ElementA_=double, LayoutA_=cutlass::layout::ColumnMajor, ElementB_=double, LayoutB_=cutlass::layout::ColumnMajor, ElementC_=double, ElementAccumulator_=double, OperatorClass_=cutlass::arch::OpClassTensorOp, ArchTag_=cutlass::arch::Sm80, ThreadblockShape_=cutlass::gemm::GemmShape<64, 64, 16>, WarpShape_=cutlass::gemm::GemmShape<32, 32, 16>, InstructionShape_=cutlass::gemm::GemmShape<8, 8, 4>, EpilogueOutputOp_=cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, ThreadblockSwizzle_=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Stages=3, AlignmentA=1, AlignmentB=1, SplitKSerial=false, Operator_=cutlass::arch::OpMultiplyAdd, GatherA=false, GatherB=false, ScatterD=false, PermuteDLayout=cutlass::layout::NoPermute]" matches the argument list
argument types are: (cudaStream_t *)
object type is: CutlassGemm
bigger tile size -> more registers and shared memory usage -> smaller occupancy which means less thread blocks can run on the same SM
Thanks. Following configuration is considered as a big tile size?
using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
For my purpose which kind of shape has more impact (keep some SM free or being able to run GEMM with other small kernel)? ThreadblockShape, WarpShape or InstructionShape.
your config is pretty small.
then you will use more threadblock with less occupancy.
running multiple very small kernels in many different streams in parallel is tricky. it is possible that kernels finish before the others start and no kernel work in parallel.
I have one stream for GEMM and second stream for another kernel different than GEMM.
With playing with above configuration just the grid size and number of threads are changing, and second stream is not running anything.
What abot this config?
using ThreadblockShape = cutlass::gemm::GemmShape<64, 64, 8>;
using WarpShape = cutlass::gemm::GemmShape<16, 16, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
Now the occupancy is 50% (first config was 23.4%), but not advantage for running second kernel.
stream scheduling is in hardware/driver. when the scheduler seems it is not possible or not necessary or not beneficial to run two streams in parallel, the streams are not going to run in parallel.
using streams is not always faster especially when kernels are too big or too small.
tile sizes can only control the number of threadblocks and occupancy. the problem size will also impact the runtime of the kernel. if the runtime is too short, nothing will run in parallel.
Thanks for support. Could you please give me a hint to understand how configuration is translated to grid and block size that I am seeing in nsys trace?
Also why the occupancy for last two config is equal?
using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
using ThreadblockShape = cutlass::gemm::GemmShape<64, 64, 8>;
using WarpShape = cutlass::gemm::GemmShape<16, 16, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
using ThreadblockShape = cutlass::gemm::GemmShape<128, 32, 8>;
using WarpShape = cutlass::gemm::GemmShape<64, 16, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
the nsys of the last two show the same number of threads, same shared memory usage, same register usage. so the occupancy are the same.
block size: <threadblock_x / warp_x, threadblock_y / warp_y, threadblock_z / warp_z> grid size: <problem_x / threadblock_x, problem_y / threadblock_y, problem_z / threadblock_z>
So for folowing config, the block size should be <1,1,1>?
16/16=1 32/32=1 8/8=1
using ThreadblockShape = cutlass::gemm::GemmShape<16, 32, 8>;
using WarpShape = cutlass::gemm::GemmShape<16, 32, 8>;
I am seeing <32,1,1> in the trace.
the nsys of the last two show the same number of threads, same shared memory usage, same register usage. so the occupancy are the same.
But the config is different, why block size is equal?
@hwu36 Thank you for your patience and answering my questions.
When you are talking about the tile, which tile do you mean? Thread Block Tile, Warp Tile, or Thread Tile?
I did not find GemmKernel::kThreadCount
source code. In which part of the repository it is?
What is the idea behind keeping the second and third dimension of block size equal to 1?
If CUTLASS by itself is changing the grid and block size, why do we are writing configuration?
The configuration that we are writing is referring to which part of this path?
In all example from test/unit/gemm/device/simt_dgemm_tt_sm50.cu The following part of the configuration is the same and 1 is repeating. What does it mean exactly?
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
When you are talking about the tile, which tile do you mean? Thread Block Tile, Warp Tile, or Thread Tile?
I mean all of them.
I did not find GemmKernel::kThreadCountsource code. In which part of the repository it is?
grep is useful in reading the code.
What is the idea behind keeping the second and third dimension of block size equal to 1?
It means we just need one warp or threadblock to handle N or K dimension data.
If CUTLASS by itself is changing the grid and block size, why do we are writing configuration?
grid size is decided by problem size and threadblock tile size. block size is decided by threadblock tile size and warp tile size. Tile size impacts the performance.
The configuration that we are writing is referring to which part of this path?
Those inside the green box
The following part of the configuration is the same and 1 is repeating. What does it mean exactly?
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
means every thread uses every instruction to handle 1x1x1 data each time
kEpilogueElementsPerAccess = 1
means every thread outputs one element in every store during the epilogue.
Our first two GTC talks several years ago discussed these concepts in detail.
As I understand memory resources (Shared Memory and Registers) are limiting the GEMM. So if occupancy is 12.5% then all shared memory and registers are under utilization by 12.5% of warps. During that time what is happening for the rest 87.5% of warps? Are they completely idle?
Yes, they will be idle. To better understand what is going on at that detail. You can use Nsight Compute. https://developer.nvidia.com/nsight-compute
There are many resources such as videos and blogs to help you better understand resource utilization
HW can schedule multiple threadblocks on the same SM if the scheduler thinks it is good.
Thanks Matthew, but I do not have access Nsight Compute.
I have a question about writing configuration. Is it based on the shared memory or Max thread block size (1024). I found several in examples related to A100, but the last two of them successfully compiled without properly executing the GEMM operation.
// Threadblock tile shape // Warp tile shape // TensorCore instruction shape
C1:🟢
cutlass::gemm::GemmShape<128, 128, 16>, // Shape to optimize
cutlass::gemm::GemmShape<32, 64, 16>, // Shape to optimize
cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
C2: 🟢
cutlass::gemm::GemmShape<128, 256, 128>, // Shape to optimize
cutlass::gemm::GemmShape<64, 64, 128>, // Shape to optimize
cutlass::gemm::GemmShape<16, 8, 4> // Shape to optimize
C3: 🔴
cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
C4: 🔴
cutlass::gemm::GemmShape<128, 256, 64>, // Shape to optimize
cutlass::gemm::GemmShape<64, 64, 64>, // Shape to optimize
cutlass::gemm::GemmShape<16, 8, 16> // Shape to optimize
Also if I mention following line the FP64 GEMM will be run by CUDA Core for sure?
// cutlass::arch::OpClassSimt, // tag indicating Cuda Cores
Last question. Is it possible to change the size of the initial GEMM block size? I mean the first stage at following pic.
https://github.com/NVIDIA/cutlass/blob/main/tools/library/scripts/generator.py#L3274, use these plausible tile sizes for ampere fp64 tensor core gemms
I recommend use OpClassTensorOp
explicitly
sorry for silly question, but I am not understanding following lines from your link (What are the Warp tile shape and TensorCore instruction shape)?
min_cc = 80
max_cc = 1024
alignment_constraints = [1,]
tile_descriptions = [
TileDescription([128, 128, 16], 3, [4, 2, 1], math_inst, min_cc, max_cc),
]
In which part we are configuring the Split K (reduction across threadblocks) and Sliced K (reduction across warps)
128x128x16 is threadblock size 3 is stage number 128/4, 128/2, 16/1 is warp size 8x8x4 is instruction size
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemm_universal.h#L199 controls splitk slices
if you want to use splicek, set warp k be smaller than threadblock k in tile size.
splicek is misspelling or related to splitk slicesk both?
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/gemm_universal.h#L199 controls splitk
Do we have any example in configuration or I have to compile CUTLASS again?
splicek is misspelling or related to splitk slicesk both?
not misspeliing. if your threadblock k is 64 and warp k is 32. you slice threadblock k into 2 halves.
Do we have any example in configuration or I have to compile CUTLASS again?
Just set the variable to the splitk number you want.
Do we have any example to learn how to use GemmUniversal
? I have no idea about replacing the previus GEMM with this one.
Thank you. In following link
where is the defenition of batch_count?
Here is an instantiation of GemmUniversal
based on the: https://github.com/NVIDIA/cutlass/blob/6fbc0d33800008d3180d3fefed4e1a653e5f72a0/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk.cu#L158
using precision = double;
constexpr int AlignmentA = 128 / cutlass::sizeof_bits<precision>::value;
constexpr int AlignmentB = 128 / cutlass::sizeof_bits<precision>::value;
constexpr int AlignmentC = 128 / cutlass::sizeof_bits<precision>::value;
using ArchTag = cutlass::arch::Sm80; // Tag indicating the minimum SM that supports the intended feature
using OperatorClass = cutlass::arch::OpClassTensorOp; // Operator class tag
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>; // Threadblock-level tile size (concept: GemmShape)
using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>; // Warp-level tile size (concept: GemmShape)
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>; // Instruction-level tile size (concept: GemmShape)
constexpr int NumStages = 4;
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
precision, // Element type for C and D matrix operands
AlignmentC, // Memory access granularity of C and D matrix in units of elements
precision, // Element type from internal accumaccumulation
precision>; // Data type used to compute linear combination
precision, cutlass::layout::ColumnMajor,
precision,
OperatorClass,
ArchTag,
ThreadblockShape,
WarpShape,
InstructionShape,
EpilogueOp,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
NumStages,
AlignmentA,
AlignmentB>;
But I am seeing following error:
include/cutlass/transform/threadblock/predicated_tile_access_iterator.h(356): error: static assertion failed with "Vectors implied by the thread map must be divisible by the access type."
static_assert(!(ThreadMap::kElementsPerAccess % AccessType::kElements),
^
detected during:
instantiation of class "cutlass::transform::threadblock::PredicatedTileAccessIterator<Shape_, Element_, cutlass::layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_, Gather> [with Shape_=cutlass::PitchLinearShape<32, 128>, Element_=double, AdvanceRank=0, ThreadMap_=cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, AccessType_=cutlass::Array<double, 2, true>, Gather=false]" at line 882
instantiation of class "cutlass::transform::threadblock::PredicatedTileAccessIterator<Shape_, Element_, cutlass::layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_, Gather> [with Shape_=cutlass::MatrixShape<128, 32>, Element_=double, AdvanceRank=1, ThreadMap_=cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, AccessType_=cutlass::Array<double, 2, true>, Gather=false]" at line 155 of /home/sw/cutlass/include/cutlass/gemm/kernel/gemm.h
instantiation of class "cutlass::gemm::kernel::Gemm<Mma_, Epilogue_, ThreadblockSwizzle_, SplitKSerial> [with Mma_=cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, double, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, cutlass::Array<double, 2, true>, false>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<16, 2>, 1>, 8>, cutlass::arch::CacheOperation::Always, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, double, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpStripedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<16, 2>, 1>, cutlass::Array<double, 2, true>, false>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, 0, cutlass::transform::PitchLinearWarpStripedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<16, 2>, 1>, 8>, cutlass::arch::CacheOperation::Always, double, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, double, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, cutlass::gemm::SharedMemoryClearOption::kNone, bool>, Epilogue_=cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, double, cutlass::layout::RowMajorTensorOpMultiplicand64bCrosswise, double, cutlass::layout::RowMajorTensorOpMultiplicandCongruous64b, double, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 2, 64>, double, false, cutlass::layout::NoPermute, true>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, double, cutlass::Array<double, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, double, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 2, 64>::CompactedThreadMap, double, 16>, cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, cutlass::MatrixShape<0, 4>, 1, 1>, ThreadblockSwizzle_=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, SplitKSerial=true]" at line 235 of /home/wp/cutlass/include/cutlass/gemm/kernel/default_gemm_universal.h
instantiation of class "cutlass::gemm::kernel::DefaultGemmUniversal<ElementA, LayoutA, cutlass::ComplexTransform::kNone, kAlignmentA, ElementB, LayoutB, cutlass::ComplexTransform::kNone, kAlignmentB, ElementC, LayoutC, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, Stages, Operator, SharedMemoryClear, GatherA, GatherB, ScatterD, PermuteDLayout, std::enable_if<<expression>, void>::type>::SelectBase<SwizzleT, Enable> [with ElementA=double, LayoutA=cutlass::layout::RowMajor, kAlignmentA=2, ElementB=double, LayoutB=cutlass::layout::RowMajor, kAlignmentB=2, ElementC=double, LayoutC=cutlass::layout::RowMajor, ElementAccumulator=double, OperatorClass=cutlass::arch::OpClassTensorOp, ArchTag=cutlass::arch::Sm80, ThreadblockShape=cutlass::gemm::GemmShape<128, 128, 32>, WarpShape=cutlass::gemm::GemmShape<64, 64, 32>, InstructionShape=cutlass::gemm::GemmShape<16, 8, 16>, EpilogueOutputOp=cutlass::epilogue::thread::LinearCombination<double, 2, double, double, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest, double>, ThreadblockSwizzle=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Stages=4, Operator=cutlass::arch::OpMultiplyAdd, SharedMemoryClear=cutlass::gemm::SharedMemoryClearOption::kNone, GatherA=false, GatherB=false, ScatterD=false, PermuteDLayout=cutlass::layout::NoPermute, SwizzleT=cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, Enable=void]" at line 68 of /home/wp/cutlass/include/cutlass/gemm/device/gemm_universal_base.h
Finally I did it but I am not seeing the CUTLASS kernel in the trace of nsys when batch_count > 1. What is the problem?
using CutlassGemm = cutlass::gemm::device::GemmUniversal<
ElementA, LayoutA,
ElementB, LayoutB,
ElementOutput, LayoutC,
ElementAccumulator,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<64, 64, 16>,
cutlass::gemm::GemmShape<32, 32, 16>,
cutlass::gemm::GemmShape<8, 8, 4>,
cutlass::epilogue::thread::LinearCombination<
ElementOutput,
128 / cutlass::sizeof_bits<ElementOutput>::value,
ElementAccumulator,
ElementAccumulator
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
4
>;
typename CutlassGemm::Arguments args{
cutlass::gemm::GemmUniversalMode::kGemm, // universal mode
problem_size, // problem_size
2, // batch count / splitk slices
{alpha, beta},
A, // ptr_A
B, // ptr_B
C, // ptr_C
C, // ptr_D
int64_t(), // batch_stride_A
int64_t(), // batch_stride_B
int64_t(), // batch_stride_C
int64_t(), // batch_stride_D
int64_t(lda), // stride_a
int64_t(ldb), // stride_b
int64_t(ldc), // stride_c
int64_t(ldc) // stride_d
};
maybe, not related.
you need to change
128 / cutlass::sizeof_bits<ElementOutput>::value
to 1
fp64 gemm uses alignment 1 everywhere.
Thanks @hwu36, but no change. Just I am seeing cudaLaunchKernel.
What do you think about following part? Actualy I got that config from a FP16_FP16_FP32 example.
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
4
What do you think about following part? that is fine.
insert some printf to track what is going on
By checking the status, I observed that the status is not Success only when the batch count is greater than 1. How can we gather more information about the failures that occur when the batch count is larger? What exactly happens with larger batch counts that leads to these failures?
cutlass::Status status = gemm_operator(args, nullptr, stream_cut);
// Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
if (status != cutlass::Status::kSuccess) {
return cudaErrorUnknown;
}
By adding following code I am not seeing any error from CUDA part, so just it is related to CUTLASS:
cudaError_t cudaError = cudaGetLastError();
const char* cudaErrorString = cudaGetErrorString(cudaError);
std::cout << "CUDA error: " << cudaErrorString << std::endl;
Also by followinng part:
const char* cutlassStatusString = cutlassGetStatusString(status);
std::cout << "Cutlass error: " << cutlassStatusString << std::endl;
The error is:
Cutlass error: Error Internal
Here is the code that I am using for GEMM.
#include <cutlass/gemm/device/gemm.h>
#include "cutlass/cutlass.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/gemm/kernel/gemm_grouped.h"
#include "cutlass/gemm/kernel/default_gemm_grouped.h"
#include "cutlass/gemm/device/gemm_grouped.h"
#include "cutlass/gemm/device/gemm_universal.h"
#include "cutlass/util/command_line.h"
#include "cutlass/util/distribution.h"
#include "cutlass/util/device_memory.h"
#include "cutlass/util/tensor_view_io.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/reference/host/gemm_complex.h"
#include "cutlass/util/reference/device/gemm_complex.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_copy.h"
#include "cutlass/util/reference/device/tensor_fill.h"
#include "cutlass/util/reference/host/tensor_norm.h"
/*
* Compute C = A*B using cutlass
*/
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn_mod(
cudaStream_t stream_cut,
int M,
int N,
int K,
double *ptr_alpha,
double *A,
int lda,
double *B,
int ldb,
double *ptr_beta,
double *C,
int ldc) {
double alpha = *ptr_alpha;
double beta = *ptr_beta;
// Define type definition for double-precision CUTLASS GEMM with column-major
// input matrices and 128x128x8 threadblock tile size (chosen by default).
//
// To keep the interface manageable, several helpers are defined for plausible compositions
// including the following example for double-precision GEMM. Typical values are used as
// default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
//
// To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`
using precision = double;
static int const kEpilogueElementsPerAccess = 1;
using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombination<
precision, kEpilogueElementsPerAccess, precision, precision>;
using CutlassGemm_1 = cutlass::gemm::device::Gemm<
precision, // ElementA
cutlass::layout::ColumnMajor, // LayoutA
precision, // ElementB
cutlass::layout::ColumnMajor, // LayoutB
precision, // ElementOutput
cutlass::layout::ColumnMajor, // LayoutOutput
precision, // ElementAccumulator
cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
cutlass::arch::Sm80, // tag indicating target GPU compute architecture
cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
>;
using ElementA = double;
using ElementB = double;
using ElementOutput = double;
using ElementAccumulator = double;
using LayoutA = cutlass::layout::ColumnMajor;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutC = cutlass::layout::ColumnMajor;
// Mainn Gemm operator
using CutlassGemm = cutlass::gemm::device::GemmUniversal<
ElementA, LayoutA,
ElementB, LayoutB,
ElementOutput, LayoutC,
ElementAccumulator,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<64, 64, 16>,
cutlass::gemm::GemmShape<32, 32, 16>,
cutlass::gemm::GemmShape<8, 8, 4>,
cutlass::epilogue::thread::LinearCombination<
ElementOutput,
1,//128 / cutlass::sizeof_bits<ElementOutput>::value,
ElementAccumulator,
ElementAccumulator
>,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
4
>;
// Define a CUTLASS GEMM type
CutlassGemm gemm_operator;
cutlass::gemm::GemmCoord problem_size;
problem_size.m()=M;
problem_size.n()=N;
problem_size.k()=K;
typename CutlassGemm::Arguments args{
cutlass::gemm::GemmUniversalMode::kGemm, // universal mode
problem_size, // problem_size
2, // batch count / splitk slices
{alpha, beta},
A, // ptr_A
B, // ptr_B
C, // ptr_C
C, // ptr_D
int64_t(), // batch_stride_A
int64_t(), // batch_stride_B
int64_t(), // batch_stride_C
int64_t(), // batch_stride_D
int64_t(lda), // stride_a
int64_t(ldb), // stride_b
int64_t(ldc), // stride_c
int64_t(ldc) // stride_d
};
cutlass::Status status = gemm_operator(args, nullptr, stream_cut/*nullptr*/);
//
// Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
//
if (status != cutlass::Status::kSuccess) {
printf("error in cutlass return\n");
cudaError_t cudaError = cudaGetLastError();
const char* cudaErrorString = cudaGetErrorString(cudaError);
std::cout << "CUDA error: " << cudaErrorString << std::endl;
return cudaErrorUnknown;
}
// Return success, if no errors were encountered.
return cudaSuccess;
}
have you allocated memory for the workspace?
you can follow serial splitK in https://github.com/NVIDIA/cutlass/blob/main/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu to see how it works
Thanks, I did as you mention, and it seems to work.
I was expected to see many small GEMMs running, but now by passing 64 as the batch_count, just I am seeing a GEMM kernel that runs 100x slower.
So what is the advantage of batch version?
if you have big k, small m/n. splitk can help you to saturate the gpu. if you over split or over saturate, you won't get benefit.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
@nnaron has your issue been resolved?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Closing due to inactivity.
I have the following code for GEMM on A100 and I need to add the following feature to the code
1. adding a stream instead of the default stream Is this correct?
cutlass::Status status = gemm_operator(stream);
2. Changing the resource utilization to make free a part of the GPU. I mean limiting the GEMM kernel to have some free register, shared memory, CUDA core, or SM.
Dose
cutlass::gemm::GemmShape
help?