b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

feat: splitting multihead attention into all nodes. #46

Closed b4rtaz closed 1 month ago

b4rtaz commented 1 month ago

Test

Model: Llama 3 8B Q40 Buffer: Q80 Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

Transfer size / token

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 S 646 kB + R 476 kB = 1122 kB S 578 kB + R 442 kB = 1020 kB -9.09%
4 x Raspberry Pi 5 S 2295 kB + R 714 kB = 3009 kB S 2193 kB + R 663 kB = 2856 kB -5.08%

Avg tokens / secon

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 Avg generation time 444.27 ms 381.81 ms
Avg inference time 362.73 349.94 ms -3.53%
Avg transfer time 80.11 ms 30.31 ms*
4 x Raspberry Pi 5 Avg generation time 331.47 ms 359.44 ms
Avg inference time 267.62 ms 258.00 ms -3.59%
Avg transfer time 62.34 ms 99.69 ms

* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.

b4rtaz commented 1 month ago

To merge this PR I need to fix mixtral & grok architectures.

b4rtaz commented 1 month ago

I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt and llamaAtt.

Transfer size / token

Devices 0.3.0 This PR v2 Percentage change
2 devices S 646 kB + R 476 kB = 1122 kB S 510 kB + R 442 kB = 952 kB -15.15%
4 devices S 2295 kB + R 714 kB = 3009 kB S 1887 kB + R 867 kB = 2754 kB -8.47%
8 devices S 5771 kB + R 833 kB = 6604 kB S 4819 kB + R 1487 kB = 6306 kB -4.51%

The final state of the attention synchronization looks like this for a single block:

root --- xb  ---> node
root <-- xbv ---- node
merge att

The previous implementation:

root --- xb  --> node
root <-- q  ---- node
root <-- k  ---- node
root <-- v  ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att
DifferentialityDevelopment commented 1 month ago

Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here float* logits = inference->infer(token, pos);

Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.

sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 2 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 16384 kB ⏩ Loaded 6175568 kB

Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?

b4rtaz commented 1 month ago

@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.

DifferentialityDevelopment commented 1 month ago

No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker