b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Support for another models (ollama models) #68

Open testing0mon21 opened 1 month ago

testing0mon21 commented 1 month ago

@b4rtaz Hey, thank you for your wonderful work. Could you please offer some details about how to add supported model? For example, how to to convert some ollama models like command+r or starcoder or llama3 70b to ddlama

https://ollama.com/library/command-r-plus https://ollama.com/library/llama3:70b https://ollama.com/library/starcoder2

b4rtaz commented 1 month ago

Hello @testing0mon21,

So from your list only llama3 is supported now.

To convert Llama 3 you have 2 options, you can do it by using Meta files and convert them by using convert-llama.py script, here is the tutorial. The second option is download .safetensor weights from Huggingface and convert it by using convert-hf.py.

testing0mon21 commented 1 month ago

Did I understand correctly, for other architectures it will be difficult to implement the same thing that you implemented with llama? @b4rtaz

b4rtaz commented 1 month ago

I think this depends on a specyfic architecture. Some architectures will be easy, some not. Adding new architecture is always non-zero effort. Currently DL supports: llama, mixtral and grok1.