Open ByronHsu opened 2 weeks ago
Hi @ByronHsu , thanks for your suggestions, I think 1 & 3 are easy to support.
For 1, we can adding sin_cache
and cos_cache
as optional fields to the rope apis. For long context, there might be some numerical issues with f16 sin/cos cache so we should also support f32 sin/cos cache (Our current on-the-fly sin/cos computation uses f32).
For 3, yes we can add another rope_dim field for partial rope.
Can you give a concrete example of 2?
Okay I think I understand 2 now, for example, if batch_size=3
, and indptr=[0, 1, 5, 10]
, and offsets=[4, 6, 3]
.
Then a equivalent positions
would be:
[4, 6, 7, 8, 9, 3, 4, 5, 6, 7]
.
Is that true?
Okay I think I understand 2 now, for example, if batch_size=3, and indptr=[0, 1, 5, 10], and offsets=[4, 6, 3]. Then a equivalent positions would be: [4, 6, 7, 8, 9, 3, 4, 5, 6, 7].
Yes exactly! Thank you for the prompt response! All sounds good to me.
One comment: Can we separate the new API from the current 4 flashinfer's rope functions and provide the exact same interface with vLLM? Several reasons:
apply_rope_inplace
only implements the formula on the original paper, but in reality there are much more variantsMaybe we can call this apply_rope_inplace_with_cache
, which does not calculate rope on the fly and support my proposed features
def apply_rope_inplace_with_cache(
positions: torch.Tensor,
query: torch.Tensor,
key: torch.Tensor,
head_size: int,
cos_sin_cache: torch.Tensor,
is_neox: bool,
) -> None:
...
I did a global search and found ops.batch_rotary_embedding
is not used in SGLang (looks like not in vLLM too). So we can safely skip 4th feature. thanks!
As part of SGLang Issue #1487, SGLang plans to move vLLM to optional dependencies and use flashinfer as the main dependency.
I am working on moving rope to flashinfer. My plan is to reuse most of the existing vllm rope but replace
ops.rotary_embedding
andops.batch_rotary_embedding
with flashinfer's kernel, which can be found here.However, I've noticed some gaps between the vLLM and flashinfer implementations:
positions
back tooffsets
+indptr
. Can we supportpositions
directly?In general, we can prioritize the first three issues and consider the fourth as a stretch goal.