b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Use AVX2 to speedup matmulQ40 #54

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

Hi @b4rtaz

I managed to get a significant speed up on my machine with the following changes

I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp

From my initial testing it definitely appears to be faster

With 1 worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn: Using AVX2 instructionsπŸ’‘ arch: llama2 πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 14336 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 8 πŸ’‘ vocabSize: 128256 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 500000.0 πŸ“„ bosId: 128000 πŸ“„ eosId: 128001 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 6175568 kB πŸ”Ά G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello πŸ”Ά G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World πŸ”Ά G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB ! πŸ”Ά G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB πŸ”Ά G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I πŸ”Ά G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm πŸ”Ά G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a πŸ”Ά G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot πŸ”Ά G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB . πŸ”Ά G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB πŸ”Ά G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I πŸ”Ά G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm πŸ”Ά G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here πŸ”Ά G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to πŸ”Ά G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help πŸ”Ά G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you πŸ”Ά G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with πŸ”Ά G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any πŸ”Ά G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions πŸ”Ά G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20 Avg tokens / second: 2.80 Avg generation time: 357.35 ms Avg inference time: 141.20 ms Avg transfer time: 215.70 ms

Without a worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructionsπŸ’‘ arch: llama2 πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 14336 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 8 πŸ’‘ vocabSize: 128256 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 1 πŸ’‘ ropeTheta: 500000.0 πŸ“„ bosId: 128000 πŸ“„ eosId: 128001 πŸ•’ ropeCache: 32768 kB ⏩ Loaded 6175568 kB πŸ”Ά G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello πŸ”Ά G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World πŸ”Ά G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB ! πŸ”Ά G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB πŸ”Ά G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I πŸ”Ά G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am πŸ”Ά G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an πŸ”Ά G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI πŸ”Ά G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to πŸ”Ά G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text πŸ”Ά G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based πŸ”Ά G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on πŸ”Ά G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the πŸ”Ά G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input πŸ”Ά G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I πŸ”Ά G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive πŸ”Ά G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB . πŸ”Ά G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I Generated tokens: 20 Avg tokens / second: 4.33 Avg generation time: 231.20 ms Avg inference time: 229.90 ms Avg transfer time: 0.60 ms

So it does seem to be working correctly at least, and it's definitely much faster than without it.

For reference previously I was getting With worker: Avg tokens / second: 2.60 Avg generation time: 384.90 ms Avg inference time: 184.65 ms Avg transfer time: 199.60 ms

Without worker: Avg tokens / second: 3.69 Avg generation time: 271.15 ms Avg inference time: 269.80 ms Avg transfer time: 0.90 ms

So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)

DifferentialityDevelopment commented 1 month ago

This project has taught me that I definitely need a faster networking setup, looking at connecting my machines using SFP+ connected to an switch with 4 or more SFP+ ports

b4rtaz commented 1 month ago

Merged. Great job!

b4rtaz commented 1 month ago

Confirmed the speed up.

Setup: GitHub codepaces, 4-core AMD EPYC 7763 64-Core Processor, 16GB RAM

0.5.0

@b4rtaz ➜ /workspaces/distributed-llama (main) $ ./main inference --model ./dllama_meta-llama-3-8b_q40.bin --tokenizer dllama_meta-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4
...
⏩ Loaded 6175568 kB
πŸ”Ά G  755 ms I  755 ms T    0 ms S      0 kB R      0 kB Hello
πŸ”Ά G  730 ms I  730 ms T    0 ms S      0 kB R      0 kB  world
πŸ”Ά G  759 ms I  758 ms T    0 ms S      0 kB R      0 kB !
πŸ”Ά G  819 ms I  811 ms T    7 ms S      0 kB R      0 kB  <
πŸ”Ά G  717 ms I  715 ms T    1 ms S      0 kB R      0 kB br
πŸ”Ά G  874 ms I  862 ms T   11 ms S      0 kB R      0 kB >

πŸ”Ά G  710 ms I  708 ms T    0 ms S      0 kB R      0 kB  <
πŸ”Ά G  833 ms I  827 ms T    5 ms S      0 kB R      0 kB h
πŸ”Ά G  764 ms I  762 ms T    1 ms S      0 kB R      0 kB 1
πŸ”Ά G  726 ms I  725 ms T    0 ms S      0 kB R      0 kB  align
πŸ”Ά G  864 ms I  857 ms T    6 ms S      0 kB R      0 kB ="
πŸ”Ά G  813 ms I  808 ms T    4 ms S      0 kB R      0 kB center
πŸ”Ά G  698 ms I  697 ms T    0 ms S      0 kB R      0 kB ">
πŸ”Ά G  746 ms I  739 ms T    6 ms S      0 kB R      0 kB Hi
πŸ”Ά G  710 ms I  709 ms T    0 ms S      0 kB R      0 kB  οΏ½
πŸ”Ά G  717 ms I  714 ms T    2 ms S      0 kB R      0 kB 
Generated tokens:    16
Avg tokens / second: 1.31
Avg generation time: 764.69 ms
Avg inference time:  761.06 ms
Avg transfer time:   2.69 ms

Your PR:

@b4rtaz ➜ /workspaces/distributed-llama (main) $ ./main inference --model ./dllama_meta-llama-3-8b_q40.bin --tokenizer dllama_meta-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4
...
⏩ Loaded 6175568 kB
πŸ”Ά G  568 ms I  567 ms T    1 ms S      0 kB R      0 kB Hello
πŸ”Ά G  642 ms I  642 ms T    0 ms S      0 kB R      0 kB  world
πŸ”Ά G  579 ms I  578 ms T    0 ms S      0 kB R      0 kB !
πŸ”Ά G  566 ms I  565 ms T    0 ms S      0 kB R      0 kB  

πŸ”Ά G  646 ms I  643 ms T    1 ms S      0 kB R      0 kB I
πŸ”Ά G  563 ms I  562 ms T    0 ms S      0 kB R      0 kB  am
πŸ”Ά G  818 ms I  785 ms T   32 ms S      0 kB R      0 kB  a
πŸ”Ά G  593 ms I  585 ms T    7 ms S      0 kB R      0 kB  computer
πŸ”Ά G  761 ms I  737 ms T   23 ms S      0 kB R      0 kB  science
πŸ”Ά G  579 ms I  579 ms T    0 ms S      0 kB R      0 kB  student
πŸ”Ά G  566 ms I  564 ms T    1 ms S      0 kB R      0 kB  in
πŸ”Ά G  625 ms I  623 ms T    0 ms S      0 kB R      0 kB  China
πŸ”Ά G  618 ms I  616 ms T    1 ms S      0 kB R      0 kB .
πŸ”Ά G  573 ms I  573 ms T    0 ms S      0 kB R      0 kB  I
πŸ”Ά G  690 ms I  657 ms T   32 ms S      0 kB R      0 kB  have
πŸ”Ά G  668 ms I  646 ms T   22 ms S      0 kB R      0 kB  been
Generated tokens:    16
Avg tokens / second: 1.59
Avg generation time: 628.44 ms
Avg inference time:  620.12 ms
Avg transfer time:   7.50 ms