Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
This branch contains experimental adjusments to support Grok-1. By these adjusment this version doesn't support Llama 2 model anymore, so to merge this PR I need to make more changes to support both models at the same time.
How to run Grok-1?
Clone this repository (grok-1 branch).
git clone https://github.com/b4rtaz/distributed-llama.git
2 x c3d-highcpu-90 (90 vCPU, 45 core, 177 GB memory). During the test was used only 2 x 64 cores due to Distributed Llama limitations. Achieved 4.3 tokens / second. 🎉
This branch contains experimental adjusments to support Grok-1. By these adjusment this version doesn't support Llama 2 model anymore, so to merge this PR I need to make more changes to support both models at the same time.
How to run Grok-1?
grok-1
branch).git clone https://github.com/b4rtaz/distributed-llama.git
make main
cat dllama-grok-1-q40.binaa dllama-grok-1-q40.binab dllama-grok-1-q40.binac dllama-grok-1-q40.binad dllama-grok-1-q40.binae dllama-grok-1-q40.binaf dllama-grok-1-q40.binag dllama-grok-1-q40.binah dllama-grok-1-q40.binai > dllama-grok-1-q40-final.bin
./main worker --port 9999 --nthreads 8
./main inference --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 128 --nthreads 8 --tokenizer tokenizers/grok-1-tokenizer.t --model dllama-grok-1-q40-final.bin --workers 10.0.0.1:9999
Test
I successfully started the inference of Grok-1 on 4 x 16 vCPU, 64 GB RAM (4 x Google Cloud n2d-standard-16). Achieved 1.8 tokens/second. 🎉