b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
MIT License
1.54k stars 109 forks source link

Support for Gemma 2? #115

Open sdmorrey opened 3 months ago

sdmorrey commented 3 months ago

What would be required to support Gemma 2? I'd be happy to chip in and help with the code, I just need to have a bit of insight into what would need to be changed?

b4rtaz commented 3 months ago

Hello @sdmorrey,

you should check llama2-tasks.cpp and grok1-tasks.cpp files. For different architectures DL builds a different task list. Tasks are reused of course (in grok1-tasks.cpp you can see the implementation of different tasks than Llama model uses).

I see Gemma 2 has more norm layers. Rope layer it seems it's already implemented (FalconRopeCommand). Probably the tokenizer is something that may need more work (converter), but I'm not sure.

unclemusclez commented 3 months ago

+1 for Gemma 2