Closed jeromeku closed 3 weeks ago
Sure we are exploring CuTE, and we believe it's the best way to use TMA.
The main reason we are still sticking to custom implementation is we haven't figured out how to TMA for sparse memory loading (e.g. in page attention prefill). We also have some ongoing effort on supporting AMD GPUs, and I suppose porting cutlass 3.0 code to rocm might be hard (pls correct me if I'm wrong).
I'll gradually replace some of the existing code with higher level abstractions in the next few months, and yes we welcome your contributions.
@yzh119 Thanks for the response!
Is there an easy way to tweak the source install such that only a few of the kernels are compiled (e.g., prefill
+ decode
only, or some subset thereof)? I realize that there are certain environment variables that can be set to limit the template instantiations but haven't found a coarser-grained way of building only select kernel categories.
@yzh119 nevermind -- ended up stripping out certain kernels then using DEFINE
flags and torch.utils.cpp_extension.load
to jit-compile specific kernel instantiations.
@yzh119
Awesome work -- really appreciate the clean and clearly documented code, and the well-written blogposts.
Wondering if you've experimented with building various attention implementations using
Cutlass
and specificallyCuTe
primitives (introduced with the3.x
release)? It could help with modularizing the code and perhaps making it more extensible.Is there a guide on how best to leverage the various
ptx
wrappers (mma.cuh
,cp_async.cuh
, etc.) and other building blocks you've written to authoring new (or improving existing) attention kernels?I am admittedly not at your level in writing raw CUDA -- more well-versed in
triton
-- but would love to contribute.