Open obhalerao97 opened 8 months ago
Yes, the code should work when n_heads != n_kv_heads. I have tested it with codellama-34b model that uses GQA.
Sorry, please ignore the previous comment. I mistook this with a different repo. I am no longer maintaining this repo. All new development is happening at https://github.com/ankan-ban/llama_cu_awq which supports GQA.
Okay thank you! Is there a version of the code that's able to process a batch of strings?
Hello, Have you checked for what happens when the n_heads != n_kv_heads? How does this affect the Rope rotation, MHA which now becomes GQA?