Closed b4rtaz closed 1 month ago
To merge this PR I need to fix mixtral & grok architectures.
I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt
and llamaAtt
.
Devices | 0.3.0 | This PR v2 | Percentage change |
---|---|---|---|
2 devices | S 646 kB + R 476 kB = 1122 kB | S 510 kB + R 442 kB = 952 kB | -15.15% |
4 devices | S 2295 kB + R 714 kB = 3009 kB | S 1887 kB + R 867 kB = 2754 kB | -8.47% |
8 devices | S 5771 kB + R 833 kB = 6604 kB | S 4819 kB + R 1487 kB = 6306 kB | -4.51% |
The final state of the attention synchronization looks like this for a single block:
root --- xb ---> node
root <-- xbv ---- node
merge att
The previous implementation:
root --- xb --> node
root <-- q ---- node
root <-- k ---- node
root <-- v ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att
Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here float* logits = inference->infer(token, pos);
Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.
sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 2 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 16384 kB ⏩ Loaded 6175568 kB
Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?
@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.
No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker
Test
Model: Llama 3 8B Q40 Buffer: Q80 Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch
Transfer size / token
Avg tokens / secon
* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.