Closed OrenLeung closed 2 months ago
I have no clue how the pyT integration works here, but CUTLASS is not a prebuilt kernel library like cuBLAS, nor does it have any heuristics. It requires an expert who knows what kernel will work best to instance that and run it. In lieu of this, you can also built a ton of kernels in the library and autotune them using the cutlass profiler, however, even in that case you are only going to build an extremely small subset of the millions of possible kernels CUTLASS supports. You are also likely to miss out on many tuning knobs like rasterization remapping we just released with 3.5.1. Out of the box comparisons of CUTLASS with anything else (cuBLAS, Triton) is not a straightforward thing and requires deep knowledge of GPU architecture, CUTLASS itself, and whatever you are comparing it in order to ensure a faithful comparision.
That said, @jackkosaian our python interface should certainly be picking a better default here I'd think? for
M, N, K = 8192, 8192, 8192
CUTLASS should be hitting >= 1.5 PFLOP/s
Hi Vijay @thakkarV ,
I appreciate your quick reply.
I do appreciate the flexible of cutlass. For example, i am trying to run e5m2 by e5m2 which cublas does not support as it never used in ML.
In lieu of this, you can also built a ton of kernels in the library and autotune them using the cutlass profiler, however, even in that case you are only going to build an extremely small subset of the millions of possible kernels CUTLASS supports.
Thanks for the tip about the profiler, I will try using that to build INT8 and e5m2 kernels.
Hi,
I am trying to benchmark the difference TFLOP/s between cutlass and cublas (through pytorch)
i am following the example way of calling a GEMM op from your python example link
Unfortunately I see that cutlass is only able to 321 TFLOP/s on fp8 vs 1296 TFLOP/s with cuBLAS. Do yall have anything suggestions on how to improve the performance? I have attached the reprod script below
Results
Setup
pip install python-cutlass
cuBLAS http://12.5.3.2/
docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=6710886 --privileged --gpus all -v $(pwd):/workspace [nvcr.io/nvidia/pytorch:24.07-py3](http://nvcr.io/nvidia/pytorch:24.07-py3
Reprod Script