feat: splitting multihead attention into all nodes.

b4rtaz commented 1 month ago

Test

Model: Llama 3 8B Q40 Buffer: Q80 Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

Transfer size / token

Devices	0.3.0	This PR	Percentage change
2 x Raspberry Pi 5	S 646 kB + R 476 kB = 1122 kB	S 578 kB + R 442 kB = 1020 kB	-9.09%
4 x Raspberry Pi 5	S 2295 kB + R 714 kB = 3009 kB	S 2193 kB + R 663 kB = 2856 kB	-5.08%

Avg tokens / secon

Devices		0.3.0	This PR	Percentage change
2 x Raspberry Pi 5	Avg generation time	444.27 ms	381.81 ms
	Avg inference time	362.73	349.94 ms	-3.53%
	Avg transfer time	80.11 ms	30.31 ms*
4 x Raspberry Pi 5	Avg generation time	331.47 ms	359.44 ms
	Avg inference time	267.62 ms	258.00 ms	-3.59%
	Avg transfer time	62.34 ms	99.69 ms

* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.

b4rtaz commented 1 month ago

To merge this PR I need to fix mixtral & grok architectures.

b4rtaz commented 1 month ago

I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt and llamaAtt.

Transfer size / token

Devices	0.3.0	This PR v2	Percentage change
2 devices	S 646 kB + R 476 kB = 1122 kB	S 510 kB + R 442 kB = 952 kB	-15.15%
4 devices	S 2295 kB + R 714 kB = 3009 kB	S 1887 kB + R 867 kB = 2754 kB	-8.47%
8 devices	S 5771 kB + R 833 kB = 6604 kB	S 4819 kB + R 1487 kB = 6306 kB	-4.51%

The final state of the attention synchronization looks like this for a single block:

root --- xb  ---> node
root <-- xbv ---- node
merge att

The previous implementation:

root --- xb  --> node
root <-- q  ---- node
root <-- k  ---- node
root <-- v  ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att

DifferentialityDevelopment commented 1 month ago

Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here float* logits = inference->infer(token, pos);

Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.

sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 2 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 16384 kB ⏩ Loaded 6175568 kB

Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?

b4rtaz commented 1 month ago

@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.

DifferentialityDevelopment commented 1 month ago

No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker

b4rtaz / distributed-llama