ankan-ban / llama2.cu

Inference Llama 2 in one file of pure Cuda
MIT License
16 stars 2 forks source link

Probable case not considered #6

Open obhalerao97 opened 8 months ago

obhalerao97 commented 8 months ago

Hello, Have you checked for what happens when the n_heads != n_kv_heads? How does this affect the Rope rotation, MHA which now becomes GQA?

ankan-ban commented 8 months ago

Yes, the code should work when n_heads != n_kv_heads. I have tested it with codellama-34b model that uses GQA.

ankan-ban commented 8 months ago

Sorry, please ignore the previous comment. I mistook this with a different repo. I am no longer maintaining this repo. All new development is happening at https://github.com/ankan-ban/llama_cu_awq which supports GQA.

obhalerao97 commented 8 months ago

Okay thank you! Is there a version of the code that's able to process a batch of strings?