b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

grok-1 support. #18

Closed b4rtaz closed 2 months ago

b4rtaz commented 3 months ago

This branch contains experimental adjusments to support Grok-1. By these adjusment this version doesn't support Llama 2 model anymore, so to merge this PR I need to make more changes to support both models at the same time.

How to run Grok-1?

  1. Clone this repository (grok-1 branch). git clone https://github.com/b4rtaz/distributed-llama.git
  2. Build Distributed Llama: make main
  3. Download quantized (Q40) weights from https://huggingface.co/b4rtaz/grok-1-dllama (180GB).
  4. Merge split models files: cat dllama-grok-1-q40.binaa dllama-grok-1-q40.binab dllama-grok-1-q40.binac dllama-grok-1-q40.binad dllama-grok-1-q40.binae dllama-grok-1-q40.binaf dllama-grok-1-q40.binag dllama-grok-1-q40.binah dllama-grok-1-q40.binai > dllama-grok-1-q40-final.bin
  5. Run workers: ./main worker --port 9999 --nthreads 8
  6. Run root node: ./main inference --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 128 --nthreads 8 --tokenizer tokenizers/grok-1-tokenizer.t --model dllama-grok-1-q40-final.bin --workers 10.0.0.1:9999

Test

I successfully started the inference of Grok-1 on 4 x 16 vCPU, 64 GB RAM (4 x Google Cloud n2d-standard-16). Achieved 1.8 tokens/second. 🎉

Screenshot 2024-04-04 at 23 52 56
b4rtaz commented 3 months ago

Test

2 x c3d-highcpu-90 (90 vCPU, 45 core, 177 GB memory). During the test was used only 2 x 64 cores due to Distributed Llama limitations. Achieved 4.3 tokens / second. 🎉

Screenshot 2024-04-05 at 21 57 44