b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Use AVX2 to speedup matmulQ40 #53

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

Hi @b4rtaz

I managed to get a significant speed up on my machine with the following changes

I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp

From my initial testing it definitely appears to be faster

With 1 worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn: Using AVX2 instructionsπŸ’‘ arch: llama2 πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 14336 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 8 πŸ’‘ vocabSize: 128256 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 500000.0 πŸ“„ bosId: 128000 πŸ“„ eosId: 128001 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 6175568 kB πŸ”Ά G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello πŸ”Ά G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World πŸ”Ά G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB ! πŸ”Ά G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB πŸ”Ά G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I πŸ”Ά G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm πŸ”Ά G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a πŸ”Ά G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot πŸ”Ά G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB . πŸ”Ά G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB πŸ”Ά G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I πŸ”Ά G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm πŸ”Ά G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here πŸ”Ά G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to πŸ”Ά G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help πŸ”Ά G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you πŸ”Ά G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with πŸ”Ά G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any πŸ”Ά G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions πŸ”Ά G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20 Avg tokens / second: 2.80 Avg generation time: 357.35 ms Avg inference time: 141.20 ms Avg transfer time: 215.70 ms

Without a worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructionsπŸ’‘ arch: llama2 πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 14336 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 8 πŸ’‘ vocabSize: 128256 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 1 πŸ’‘ ropeTheta: 500000.0 πŸ“„ bosId: 128000 πŸ“„ eosId: 128001 πŸ•’ ropeCache: 32768 kB ⏩ Loaded 6175568 kB πŸ”Ά G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello πŸ”Ά G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World πŸ”Ά G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB ! πŸ”Ά G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB πŸ”Ά G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I πŸ”Ά G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am πŸ”Ά G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an πŸ”Ά G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI πŸ”Ά G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to πŸ”Ά G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text πŸ”Ά G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based πŸ”Ά G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on πŸ”Ά G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input πŸ”Ά G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I πŸ”Ά G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive πŸ”Ά G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB . πŸ”Ά G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I Generated tokens: 20 Avg tokens / second: 4.33 Avg generation time: 231.20 ms Avg inference time: 229.90 ms Avg transfer time: 0.60 ms

So it does seem to be working correctly at least, and it's definitely much faster than without it.

For reference previously I was getting With worker: Avg tokens / second: 2.60 Avg generation time: 384.90 ms Avg inference time: 184.65 ms Avg transfer time: 199.60 ms

Without worker: Avg tokens / second: 3.69 Avg generation time: 271.15 ms Avg inference time: 269.80 ms Avg transfer time: 0.90 ms

So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)

b4rtaz commented 1 month ago

Nice!!! Could you create a PR without server, I'll merge it immediately.

DifferentialityDevelopment commented 1 month ago

Nice!!! Could you create a PR without server, I'll merge it immediately.

Whoops! my bad! Sure will do, for simplicity sake I'm going to close this PR and open another