Open mrakgr opened 2 months ago
Of course, just #include "cutlass/gemm/kernel/gemm_universal.h"
doesn't work for me. The IDE doesn't see the class inside the file.
I just recompiled Cutlass with 90 support. Here is how it looks like in the IDE.
GroupProblemShape
has a type error, and these kinds of issues aren't present in the non-90 examples. More problematically, everything below #if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
is grayed out.
If I were to remove those macro checks...
Please have mercy on me here.
@ANIKET-SHIVAM CC
@mrakgr, just to understand your setup, are you using the CMake configuration flow built into the IDE? Or are you just using the IDE to view/edit the files but building from elsewhere like powershell? I ask because the CUTLASS device-side code relies on many definitions that are computed and generated in CMake. To ensure things work properly, I would start at least by copying an existing example and get that working with a custom kernel you need, and then modify that example CMakeLists.txt to produce a library that contains the kernel for you to import in your other larger project (or something along those lines).
My actual setup is that I am using my own language Spiral which compiles to Python + Cuda. You can see some examples in that tutorial to get a sense of how it looks. My stance towards complex C++ build setups is that I don't like them and don't want anything to do with them if possible. And it's not possible because Cutlass is hoisting CMake upon me, leaving me no choice but to interact with it. I've only been using VS because I am on Windows and am trying to get a handle on the library, which would be a lot harder from Spiral directly.
I ask because the CUTLASS device-side code relies on many definitions that are computed and generated in CMake.
If that is true, that might make Cutlass unusable for me. I have a fully fused ML library and all the matmults need to be called on the device directly instead of being called from host. I've done my own, but it's not as good as the Cutlass one, so my goal is to replace it with that. I want to find a way to use Cutlass as a header-only library, but if that is impossible, I'll give up and wait for the next gen consumer NVidia cards to come out. They have warp group matrix multiplication instructions operating on shared memory, which I expect should make it easier to make a fully performant matmult kernel.
Even if it's somehow possible to get it to compile by fitting all the right compiler options, I have a limited complexity budget. I am willing to do one or two extra compile time options to set up the Cutlass library when including it in Spiral code, but absolutely nothing like those enormous CMake files that I see in its repo.
As an aside, I did try out cuBLASDx and got it to work, but the performance was absolutely horrible. I am just hoping that I can either find a way to bring in Cutlass, or that on next gen cards, matmults will be easier to implement.
Also, I should mention one aspect of my setup. All the tensor/matrix dimensions, layouts and their strides are known at compile time, so in theory, Cutlass should have everything it needs to select the optimal kernel at compile time without the need for build options.
In the ML library, I have poker games, which are running directly on the GPU, which are calling the ML library functions. As the games are register hungry, I've been running one block per SM, and the ML library and the matmult would need to be performant under that condition.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Following up on #1291, I am finally making the time to properly understand Cutlass. I can build and run the examples in the repo, but I am having a lot of difficulty understanding the type errors I am getting when I try to import the
GemmUniversal
classes. For example...I've isolated this piece of code from one of the tests and am trying to bring it into another project without success. Even when I seemingly have all the necessary includes, I see something like this...
Probably I shouldn't be trying to import test harness code, but the trouble is that there are no clean examples how to use the universal kernel.
https://github.com/NVIDIA/cutlass/blob/main/media/docs/quickstart.md#launching-a-gemm-kernel-using-cutlass-30-or-newer
The quickstart one is clean but doesn't typecheck for me.
There are some examples like the
48_hopper_warp_specialized_gemm
which do have use examples of it, but when I get rid of the device arch checking macros, I just get a ton of type errors. Those kinds of examples aren't usable.I am going to try recompiling the Cutlass library for 9.0 devices just to see if something changes.
But otherwise, could I get some help to start me off with Cutlass? Thanks.