Closed Qubitium closed 2 months ago
This is related to #35 , Yi has a GQA group size (num_qo_heads/num_kv_heads) of 7 which is not compiled in our kernels. I'm refactoring the code so that we don't need a specialized kernel for each group size and the issue will be resolved then.
Sorry about the confusing error message, it's a dispatching issue but not related to data type.
We have tested sglang with flashinfer 0.0.2 and flashinfer 0.0.3-dev (https://github.com/flashinfer-ai/flashinfer/commit/238563fb8fa5f3e5906bb951c3ee84659ed9265a) and both will crash in flashinfer with following stacktrace under A100.
Model: Yi-34B OS: Ubuntu 22.04 Gpu: A100 80GB
Yi-6B and Yi-9B has no such issue. Yi is llama2 based arch if I am not mistaken.
@yzh119 Since the stacktrace is vague to me,
BatchPrefillWithPagedKVCache failed to dispatch with dtype Half
, I am first reproting the bug here. If you think this is sglang related, I will move bug to sglang. Thanks!