flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.21k stars 111 forks source link

Feature request: support non-contiguous tensors for attention #311

Closed Yard1 closed 2 months ago

Yard1 commented 3 months ago

Currently, in vLLM we do qkv projection as one matmul, resulting in a tensor that is then split into q, k and v. This splitting causes the tensors to be non-contiguous. It would be great if we could support this case (and avoid copies required to make the tensors contiguous) in FlashInfer's attention kernels by passing a stride parameter for each of the tensors, similarly to how Flash Attention does it. This concerns both Paged (just query) and Ragged (query, key, value) kernels.

yzh119 commented 3 months ago

I think it's doable by adding some strides field to https://github.com/flashinfer-ai/flashinfer/blob/3d43dc9dc1a2ae804eaa7e40b4555e471fd03fe3/include/flashinfer/layout.cuh#L66 https://github.com/flashinfer-ai/flashinfer/blob/3d43dc9dc1a2ae804eaa7e40b4555e471fd03fe3/include/flashinfer/page.cuh#L72

yzh119 commented 3 months ago

This is now on the top of my TODO list, and I'll make this feature available in v0.0.9 release.

yzh119 commented 2 months ago

Done in #404 .