[Q&A] Cutlass and contributing

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

768 stars 64 forks source link

[Q&A] Cutlass and contributing #295

Closed jeromeku closed 3 weeks ago

jeromeku commented 4 weeks ago

@yzh119

Awesome work -- really appreciate the clean and clearly documented code, and the well-written blogposts.

Wondering if you've experimented with building various attention implementations using Cutlass and specifically CuTe primitives (introduced with the 3.x release)? It could help with modularizing the code and perhaps making it more extensible.

Is there a guide on how best to leverage the various ptx wrappers (mma.cuh, cp_async.cuh, etc.) and other building blocks you've written to authoring new (or improving existing) attention kernels?

I am admittedly not at your level in writing raw CUDA -- more well-versed in triton -- but would love to contribute.

yzh119 commented 3 weeks ago

Sure we are exploring CuTE, and we believe it's the best way to use TMA.

The main reason we are still sticking to custom implementation is we haven't figured out how to TMA for sparse memory loading (e.g. in page attention prefill). We also have some ongoing effort on supporting AMD GPUs, and I suppose porting cutlass 3.0 code to rocm might be hard (pls correct me if I'm wrong).

I'll gradually replace some of the existing code with higher level abstractions in the next few months, and yes we welcome your contributions.

jeromeku commented 3 weeks ago

@yzh119 Thanks for the response!

Is there an easy way to tweak the source install such that only a few of the kernels are compiled (e.g., prefill + decode only, or some subset thereof)? I realize that there are certain environment variables that can be set to limit the template instantiations but haven't found a coarser-grained way of building only select kernel categories.

jeromeku commented 3 weeks ago

@yzh119 nevermind -- ended up stripping out certain kernels then using DEFINE flags and torch.utils.cpp_extension.load to jit-compile specific kernel instantiations.