flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
822 stars 77 forks source link

Qwen1.5-32B failed: BatchPrefillWithPagedKVCachePyTorchWrapper failed to dispatch group_size 5 #254

Closed QwertyJack closed 1 month ago

QwertyJack commented 1 month ago

I am trying to use SGLang serving Qwen1.5-32B-Chat but it complains

  ...
  File "/home/jack/.conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 92, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/.conda/envs/sglang/lib/python3.11/site-packages/flashinfer/prefill.py", line 498, in forward
    return self._wrapper.forward(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: BatchPrefillWithPagedKVCachePyTorchWrapper::Forward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool, unsigned int, bool, float, float, float, bool)::<lambda()>::<lambda()>::<lambda()>::<lambda()>::<lambda()> failed to dispatch grou
p_size 5

Env: rtx 3090, cuda-12.3, py3.11, torch-2.3.0, SGLang-0.1.16, flashinfer-0.0.4+cu121torch2.3-cp311-cp311-linux_x86_64.whl

Btw, Qwen1.5-14B-Chat works like a charm ;)

Any chance that we can get Qwen1.5-32B supported?

Thanks in advance!

yzh119 commented 1 month ago

@QwertyJack Thank you for the feedback.

With https://github.com/flashinfer-ai/flashinfer/pull/301 merged, now flashinfer's prefill kernels support any group sizes, and decode kernels support group size 1-8.