NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.66k stars 971 forks source link

Segmentation fault (core dumped) by embedding CUTLASS in MAGMA #913

Closed nnaron closed 1 year ago

nnaron commented 1 year ago

Hello!

I am seeing following error by using a function that is defined based on the CUTLASS (repeated many lines):

cutlass/include/cutlass/arch/mma_sm80.h:516: void cutlass::arch::Mma<cutlass::gemm::GemmShape<8, 8, 4>, 32, double, cutlass::layout::RowMajor, double, cutlass::layout::ColumnMajor, double, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::operator()(cutlass::Array<double, 2, true> &, const cutlass::Array<double, 1, true> &, const cutlass::Array<double, 1, true> &, const cutlass::Array<double, 2, true> &) const: block: [1,1,0], thread: [96,0,0] Assertion0 && __PRETTY_FUNCTION__failed.

The compilation to produce object file for V100s was:

nvcc -c cutlass_gemm.o cutlass_gemm.cu -I ~/cutlass/include/ -I ~/cutlass/tools/util/include -L ~/cutlass/build/tools/library -lcutlass -arch=sm_70 -gencode=arch=compute_70,code=sm_70

I replaced:

                magma_dgemm( MagmaNoTrans, MagmaNoTrans,
                             n-(j+nb), m-j, nb,
                             c_neg_one, dAT(j-nb, j+nb), lddat,
                                        dAT(j,    j-nb), lddat,
                             c_one,     dAT(j,    j+nb), lddat, queues[1] );

With: cutlass_dgemm_nn( n-(j+nb), m-j, nb, &c_neg_one, dAT(j-nb, j+nb), lddat, dAT(j, j-nb), lddat, &c_one, dAT(j, j+nb), lddat);

This is my first time that I am embedding CUTLASS in an application, could you please give a hint about this problem?

The GEMM function is:

#include <cutlass/gemm/device/gemm.h>

/* 
 * Compute C = A*B using cutlass
*/
// Defines cutlass::gemm::device::Gemm, the generic Gemm computation template class.
cudaError_t cutlass_dgemm_nn(
  int M,
  int N,
  int K,
  double *ptr_alpha,
  double *A,
  int lda,
  double *B,
  int ldb,
  double *ptr_beta,
  double *C,
  int ldc) {

  double alpha = *ptr_alpha;
  double beta = *ptr_beta;

  // Define type definition for double-precision CUTLASS GEMM with column-major
  // input matrices and 128x128x8 threadblock tile size (chosen by default).
  //
  // To keep the interface manageable, several helpers are defined for plausible compositions
  // including the following example for double-precision GEMM. Typical values are used as
  // default template arguments. See `cutlass/gemm/device/default_gemm_configuration.h` for more details.
  //
  // To view the full gemm device API interface, see `cutlass/gemm/device/gemm.h`

  using ElementOutput = double;
  using ElementAccumulator = double;
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using RowMajor = cutlass::layout::RowMajor;
  /*
  using CutlassGemm = cutlass::gemm::device::Gemm<
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    double,        // Data-type of C matrix
    ColumnMajor>; // Layout of C matrix

  */
  using CutlassGemm = cutlass::gemm::device::Gemm< 
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    ElementOutput,        // Data-type of C matrix
    ColumnMajor,    // Layout of C matrix    , LayoutC = layout::ColumnMajor;                       
    ElementAccumulator,                                     // ElementAccumulator
    cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
    cutlass::arch::Sm80,                        // tag indicating target GPU compute architecture
    cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
    /*
    cutlass::epilogue::thread::LinearCombination<
      ElementOutput,
      1,
      ElementAccumulator,
      ElementAccumulator
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    4*/
  >; 

  // Define a CUTLASS GEMM type
  CutlassGemm gemm_operator;
  // Construct the CUTLASS GEMM arguments object.
  //
  // One of CUTLASS's design patterns is to define gemm argument objects that are constructible
  // in host code and passed to kernels by value. These may include pointers, strides, scalars,
  // and other arguments needed by Gemm and its components.
  //
  // The benefits of this pattern are (1.) a structured, composable strategy for passing host-constructible
  // arguments to kernels and (2.) minimized initialization overhead on kernel entry.
  //
  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
                              {A, lda},    // Tensor-ref for source matrix A
                              {B, ldb},    // Tensor-ref for source matrix B
                              {C, ldc},    // Tensor-ref for source matrix C
                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
                              {alpha, beta}); // Scalars used in the Epilogue
  //
  // Launch the CUTLASS GEMM kernel.
  //
  cutlass::Status status = gemm_operator(args);
  //
  // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
  //
  if (status != cutlass::Status::kSuccess) {
    return cudaErrorUnknown;
  }
  // Return success, if no errors were encountered.
  return cudaSuccess;
} 
hwu36 commented 1 year ago

fp64 tensor cores requires ampere (sm80) at least

nnaron commented 1 year ago

fp64 tensor cores requires ampere (sm80) at least

Thank you. I need a simple GEMM with possibility to specify the stream for Volta. Which CUTLASS example is fine to adapt it?

hwu36 commented 1 year ago

Fp16

nnaron commented 1 year ago

Sorry a DGEMM (FP64), just with CUDA core is fine.

hwu36 commented 1 year ago

Then just fp64 simt kernel

hwu36 commented 1 year ago

Like these test/unit/gemm/device/simt_dgemm_tt_sm50.cu

nnaron commented 1 year ago

I have updated CUTLASS to the last version and now I am not able to compile the first code agan.

nvcc -c cutlass_gemm.o cutlass_gemm.cu -I ~/cutlass_0423/include/ -I ~/cutlass_0423/tools/util/include -L ~/cutlass_0423/build/tools/library -lcutlass -arch=sm_70 -gencode=arch=compute_70,code=sm_70


 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(43): error: namespace "std" has no member "conjunction"                                                                          

 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(44): error: namespace "std" has no member "conjunction_v"                                                                        

 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(46): error: namespace "std" has no member "disjunction"                                                                          

 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(47): error: namespace "std" has no member "disjunction_v"                                                                        

 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(49): error: namespace "std" has no member "negation"

 /home/hsr/cutlass_0423/include/cute/util/type_traits.hpp(50): error: namespace "std" has no member "negation_v"

 /home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(78): error: "auto" is not allowed here

 /home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(80): error: "auto" is not allowed here

 /home/hsr/cutlass_0423/include/cute/numeric/integral_constant.hpp(80): warning: constant "n" is not used in or cannot be deduced from the template argument list of class template "cute::is_constant<<error-constant>, cute::constant<T, >>"

About simt_dgemm_tt_sm50.cu

How to modify it to have an API like cuBLAS GEMM? Something like following:

cutlass_dgemm_nn( int M, int N, int K, double ptr_alpha, double A, int lda, double B, int ldb, double ptr_beta, double *C, int ldc)

thakkarV commented 1 year ago

If that is your full NVCC command line, can you add --std=c++17 to it? CUTLASS 3.0 requires Cpp17 as a minimum.

nnaron commented 1 year ago

Thank you. Now the code has been compiled, but how it is possible to convert that code to Volta architecture?

I tought maigh by commenting cutlass::arch::OpClassTensorOp and changing cutlass::arch::Sm80 to cutlass::arch::Sm70 it is possible, but it is not compiling anymore .


    using CutlassGemm = cutlass::gemm::device::Gemm< 
    double,        // Data-type of A matrix
    ColumnMajor,  // Layout of A matrix
    double,        // Data-type of B matrix
    ColumnMajor,  // Layout of B matrix
    ElementOutput,        // Data-type of C matrix
    ColumnMajor,    // Layout of C matrix    , LayoutC = layout::ColumnMajor;                       
    ElementAccumulator,                                     // ElementAccumulator
    //      cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
    cutlass::arch::Sm70,                        // tag indicating target GPU compute architecture
    cutlass::gemm::GemmShape<64, 64, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<32, 32, 16>, // Shape to optimize
    cutlass::gemm::GemmShape<8, 8, 4> // Shape to optimize
thakkarV commented 1 year ago

Please see the instantiations in test/unit/gemm/device/simt_dgemm_tn_sm50.cu on how to instance a Volta SIMT DGEMM. TLDR is you have to set the arch tag to Sm50 instead of Sm70 and opclass to OpClassSimt

nnaron commented 1 year ago

Sorry for asking but I have difficulty to understand that code. test/unit/gemm/device/simt_dgemm_tn_sm50.cu

For example following part is a macro. Is it correct?


////////////////////////////////////////////////////////////////////////////////
// Elements / Thread:   4 x   4
//    Threads / Warp:   8 x   4
//     Warps / Block:   4 x   2
//       Threadblock: 128 x  32 x  8
CUTLASS_TEST_L2(SM50_device_dgemm_tt, 128x32x8_32x16x1_4x4_8x4_4x2, {
    using precision = double;
    using ThreadblockShape = cutlass::gemm::GemmShape<128, 32, 8>;
    using WarpShape = cutlass::gemm::GemmShape<32, 16, 8>;

    static int const kEpilogueElementsPerAccess = 1;
    using InstructionShape = cutlass::gemm::GemmShape<1, 1, 1>;
    using EpilogueOutputOp = cutlass::epilogue::thread::LinearCombination<
        precision, kEpilogueElementsPerAccess, precision, precision>;

    using Gemm = cutlass::gemm::device::Gemm<
        precision, cutlass::layout::RowMajor,
        precision, cutlass::layout::RowMajor,
        precision, cutlass::layout::RowMajor,
        precision,
        cutlass::arch::OpClassSimt,
        cutlass::arch::Sm50,
        ThreadblockShape, WarpShape, InstructionShape,
        EpilogueOutputOp,
        cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
        2 // Stages
    >;
    EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
} )

And is this line for executing? So where is the innput data?

EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());

hwu36 commented 1 year ago

grep TestAllGemm from that directory and you can find the testbed.