I have added "--gpu-architecture=sm_90a" as nvcc flag to compile pybind just as before (by directly including source code as header file) but saw severe degration in performance. (85 TFLOPS with half precision matmul in hopper GPU.) Can wrong compilation degrade performance as well?
from setuptools import setup
from torch.utils import cpp_extension
Try taking a look at the Python example 02_pytorch_extension_grouped_gemm. It emits a setup.py file that can be used with PyTorch. You can follow the example in that file.
I want to use CUTLASS functions with pytorch tensors in python. I have before used pybind to compile CUDA programs that can be called from python. However, it seems like CUTLASS requires CMake for compilation and import (https://github.com/NVIDIA/cutlass/tree/main/examples/60_cutlass_import). I barely know CMake. Is using CMake as in the example and using CMakeExtension for setup.py (https://github.com/pybind/cmake_example/blob/master/setup.py) the only way or is there something I am missing?
I have added "--gpu-architecture=sm_90a" as nvcc flag to compile pybind just as before (by directly including source code as header file) but saw severe degration in performance. (85 TFLOPS with half precision matmul in hopper GPU.) Can wrong compilation degrade performance as well?
Thanks!