-
# Summary
We recently landed support for grouped query attention via use `enable_gqa` on sdpa, however this is only enabled on the flash attention backend. This leads to a weird situation where it c…
-
For Llama3-70B TP8, we have 8 q-heads and 1 k-head.
Assuming we have 4000 shared prefix tokens with batch 8, cascade decoding is much slower than baseline (26us vs 19us). But if we set k-heads to 8…
-
Thank you for your solid work. I would like to ask if the current version is suitable for GQA architecture models, such as LLaMA-2-70B and LLaMA-3.
-
### Describe the issue
Hi
I come from https://github.com/vllm-project/vllm/issues/6701.
I am wondering when will the 2.3.110 IPEX be released.
-
This issue occurs in the llama2 fp16 and int4 weights models, as well as a trimmed model that returns after the first GQA node.
-
When I try to train the GQA_200 dataset using the following command, the error prompts AttributeError: module ‘pysgg.data.datasets’ has no attribute ‘GQADataset’, and I can't find any file about GQADa…
-
### 📚 The doc issue
I don't think it's possible to get the structure of the dataset as depicted below in the diagram as shown in the diagram.
### Suggest a potential alternative/fix
I don't k…
-
Great job!
We found that Quest is implemented on the previous version of flashinfer and some common feature are not support currently.
* bsz > 1
* GQA
* CUDA graph
Is there any plan to update t…
-
### 🐛 Describe the bug
Hi AMD Team,
On MI300X pytorch nightly grouped query attention is running into numeric errors. I have confirmed on H100 that this script does not have numeric errors.
C…
-
Hi!, I'm trying to replicate your implementation with Llama 2-13B and 7B, but curiously the runtimes didn't make sense (llama 2 gqa > llama 2 WITHOUT gqa) there is a little difference between my code …