Closed mathfirst closed 11 months ago
which architecture are you using?
cutlass requires data to be stored in a special swizzled layout to prevent bank conflicts. not like what your pseudo code does. maybe you can just use wmma in your code?
Thanks for your nice reply. I use A100 GPU. Yes, your suggestion is good and I'll try. I was wondering, in terms of speedup, cutlass is much faster than using wmma directly? or they are comparable?
cutlass is faster but more complex. the tradeoff you need to make.
@mathfirst, FYI, we plan to release cuBLASDx in the next few months for executing GEMMs in a CUDA kernel. You can check out cuFFTDx to get an idea of the API and intent. https://docs.nvidia.com/cuda/cufftdx/index.html
@mnicely That sounds great. I tried wmma but it is not as fast as expected. It may need some optimization. Looking forward to cuBLASDx! I will check out cuFFTDx as you suggested.
Closing this for now. Keep an eye out for cuBLASDx, as I think it will be the best solution
cuBLASDX examples have been posted https://github.com/NVIDIA/CUDALibrarySamples/tree/master/MathDx
cuBLASDX examples have been posted https://github.com/NVIDIA/CUDALibrarySamples/tree/master/MathDx
Thanks for your hard work!
I would like to use cutlass to perform matrix multiplication within a cuda kernel. Specifically, before the matrix multiplication, I need to do something to load the input matrices A(mxk) and B(kxn) onto shared memory, then perform the matrix multiplication C=AB (mxn), after C obtained, I need to do something on C. The code snippet is something like this.
Actually, the question is how to use cutlass within a cuda kernel. I am new to cutlass, it seems hard for me to use cutlass, anybody can show me an example with least code for my question? Any instruction will be appreciated.