Closed QwertyJack closed 1 month ago
I am trying to use SGLang serving Qwen1.5-32B-Chat but it complains
... File "/home/jack/.conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/layers/radix_attention.py", line 92, in prefill_forward_flashinfer o = input_metadata.prefill_wrapper.forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jack/.conda/envs/sglang/lib/python3.11/site-packages/flashinfer/prefill.py", line 498, in forward return self._wrapper.forward( ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: BatchPrefillWithPagedKVCachePyTorchWrapper::Forward(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, bool, unsigned int, bool, float, float, float, bool)::<lambda()>::<lambda()>::<lambda()>::<lambda()>::<lambda()> failed to dispatch grou p_size 5
Env: rtx 3090, cuda-12.3, py3.11, torch-2.3.0, SGLang-0.1.16, flashinfer-0.0.4+cu121torch2.3-cp311-cp311-linux_x86_64.whl
flashinfer-0.0.4+cu121torch2.3-cp311-cp311-linux_x86_64.whl
Btw, Qwen1.5-14B-Chat works like a charm ;)
Any chance that we can get Qwen1.5-32B supported?
Thanks in advance!
@QwertyJack Thank you for the feedback.
With https://github.com/flashinfer-ai/flashinfer/pull/301 merged, now flashinfer's prefill kernels support any group sizes, and decode kernels support group size 1-8.
I am trying to use SGLang serving Qwen1.5-32B-Chat but it complains
Env: rtx 3090, cuda-12.3, py3.11, torch-2.3.0, SGLang-0.1.16,
flashinfer-0.0.4+cu121torch2.3-cp311-cp311-linux_x86_64.whl
Btw, Qwen1.5-14B-Chat works like a charm ;)
Any chance that we can get Qwen1.5-32B supported?
Thanks in advance!