Closed nnaron closed 1 year ago
fp64 tensor cores requires ampere (sm80) at least
fp64 tensor cores requires ampere (sm80) at least
Thank you. I need a simple GEMM with possibility to specify the stream for Volta. Which CUTLASS example is fine to adapt it?
Fp16
Sorry a DGEMM (FP64), just with CUDA core is fine.
Then just fp64 simt kernel
Like these test/unit/gemm/device/simt_dgemm_tt_sm50.cu
I have updated CUTLASS to the last version and now I am not able to compile the first code agan.
nvcc -c cutlass_gemm.o cutlass_gemm.cu -I ~/cutlass_0423/include/ -I ~/cutlass_0423/tools/util/include -L ~/cutlass_0423/build/tools/library -lcutlass -arch=sm_70 -gencode=arch=compute_70,code=sm_70
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(43): error: namespace "std" has no member "conjunction"
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(44): error: namespace "std" has no member "conjunction_v"
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(46): error: namespace "std" has no member "disjunction"
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(47): error: namespace "std" has no member "disjunction_v"
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(49): error: namespace "std" has no member "negation"
/home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(50): error: namespace "std" has no member "negation_v"
/home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(78): error: "auto" is not allowed here
/home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(80): error: "auto" is not allowed here
/home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(80): warning: constant "n" is not used in or cannot be deduced from the template argument list of class template "cute::is_constant<<error-constant>, cute::constant<T, >>"
About simt_dgemm_tt_sm50.cu
How to modify it to have an API like cuBLAS GEMM? Something like following:
cutlass_dgemm_nn( int M, int N, int K, double ptr_alpha, double A, int lda, double B, int ldb, double ptr_beta, double *C, int ldc)
If that is your full NVCC command line, can you add --std=c++17
to it? CUTLASS 3.0 requires Cpp17 as a minimum.
Thank you. Now the code has been compiled, but how it is possible to convert that code to Volta architecture?
I tought maigh by commenting cutlass::arch::OpClassTensorOp and changing cutlass::arch::Sm80 to cutlass::arch::Sm70 it is possible, but it is not compiling anymore .
using CutlassGemm = cutlass::gemm::device::Gemm<
double, // Data-type of A matrix
ColumnMajor, // Layout of A matrix
double, // Data-type of B matrix
ColumnMajor, // Layout of B matrix
ElementOutput, // Data-type of C matrix
ColumnMajor, // Layout of C matrix , LayoutC = layout::ColumnMajor;
ElementAccumulator, // ElementAccumulator
// cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
cutlass::arch::Sm70, // tag indicating target GPU compute architecture
cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
Please see the instantiations in test/unit/gemm/device/simt_dgemm_tn_sm50.cu
on how to instance a Volta SIMT DGEMM. TLDR is you have to set the arch tag to Sm50
instead of Sm70
and opclass to OpClassSimt
Sorry for asking but I have difficulty to understand that code.
test/unit/gemm/device/simt_dgemm_tn_sm50.cu
For example following part is a macro. Is it correct?
////////////////////////////////////////////////////////////////////////////////
// Elements / Thread: 4 x 4
// Threads / Warp: 8 x 4
// Warps / Block: 4 x 2
// Threadblock: 128 x 32 x 8
CUTLASS_TEST_L2(SM50_device_dgemm_tt, 128x32x8_32x16x1_4x4_8x4_4x2, {
using precision = double;
using ThreadblockShape = cutlass::gemm::GemmShape<128, 32, 8>;
using WarpShape = cutlass::gemm::GemmShape<32, 16, 8>;
static int const kEpilogueElementsPerAccess = 1;
using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombination<
precision, kEpilogueElementsPerAccess, precision, precision>;
using Gemm = cutlass::gemm::device::Gemm<
precision, cutlass::layout::RowMajor,
precision, cutlass::layout::RowMajor,
precision, cutlass::layout::RowMajor,
precision,
cutlass::arch::OpClassSimt,
cutlass::arch::Sm50,
ThreadblockShape, WarpShape, InstructionShape,
EpilogueOutputOp,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
2 // Stages
>;
EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
} )
And is this line for executing? So where is the innput data?
EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
grep TestAllGemm
from that directory and you can find the testbed.
Hello!
I am seeing following error by using a function that is defined based on the CUTLASS (repeated many lines):
cutlass/include/cutlass/arch/mma_sm80.h:516: void cutlass::arch::Mma<cutlass::gemm::GemmShape<8, 8, 4>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::operator()(cutlass::Array<double, 2, true> &, const cutlass::Array<double, 1, true> &, const cutlass::Array<double, 1, true> &, const cutlass::Array<double, 2, true> &) const: block: [1,1,0], thread: [96,0,0] Assertion
0 && __PRETTY_FUNCTION__failed.
The compilation to produce object file for V100s was:
nvcc -c cutlass_gemm.o cutlass_gemm.cu -I ~/cutlass/include/ -I ~/cutlass/tools/util/include -L ~/cutlass/build/tools/library -lcutlass -arch=sm_70 -gencode=arch=compute_70,code=sm_70
I replaced:
With:
cutlass_dgemm_nn( n-(j+nb), m-j, nb, &c_neg_one, dAT(j-nb, j+nb), lddat, dAT(j, j-nb), lddat, &c_one, dAT(j, j+nb), lddat);
This is my first time that I am embedding CUTLASS in an application, could you please give a hint about this problem?
The GEMM function is: