grok-1 support. - Githubissues

This branch contains experimental adjusments to support Grok-1. By these adjusment this version doesn't support Llama 2 model anymore, so to merge this PR I need to make more changes to support both models at the same time.

How to run Grok-1?

Clone this repository (grok-1 branch). git clone https://github.com/b4rtaz/distributed-llama.git
Build Distributed Llama: make main
Download quantized (Q40) weights from https://huggingface.co/b4rtaz/grok-1-dllama (180GB).
Merge split models files: cat dllama-grok-1-q40.binaa dllama-grok-1-q40.binab dllama-grok-1-q40.binac dllama-grok-1-q40.binad dllama-grok-1-q40.binae dllama-grok-1-q40.binaf dllama-grok-1-q40.binag dllama-grok-1-q40.binah dllama-grok-1-q40.binai > dllama-grok-1-q40-final.bin
Run workers: ./main worker --port 9999 --nthreads 8
Run root node: ./main inference --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 128 --nthreads 8 --tokenizer tokenizers/grok-1-tokenizer.t --model dllama-grok-1-q40-final.bin --workers 10.0.0.1:9999

Test

I successfully started the inference of Grok-1 on 4 x 16 vCPU, 64 GB RAM (4 x Google Cloud n2d-standard-16). Achieved 1.8 tokens/second. 🎉

b4rtaz / distributed-llama

grok-1 support. #18

How to run Grok-1?

Test

Test