Closed DifferentialityDevelopment closed 1 month ago
This project has taught me that I definitely need a faster networking setup, looking at connecting my machines using SFP+ connected to an switch with 4 or more SFP+ ports
Merged. Great job!
Confirmed the speed up.
Setup: GitHub codepaces, 4-core AMD EPYC 7763 64-Core Processor, 16GB RAM
0.5.0
@b4rtaz β /workspaces/distributed-llama (main) $ ./main inference --model ./dllama_meta-llama-3-8b_q40.bin --tokenizer dllama_meta-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4
...
β© Loaded 6175568 kB
πΆ G 755 ms I 755 ms T 0 ms S 0 kB R 0 kB Hello
πΆ G 730 ms I 730 ms T 0 ms S 0 kB R 0 kB world
πΆ G 759 ms I 758 ms T 0 ms S 0 kB R 0 kB !
πΆ G 819 ms I 811 ms T 7 ms S 0 kB R 0 kB <
πΆ G 717 ms I 715 ms T 1 ms S 0 kB R 0 kB br
πΆ G 874 ms I 862 ms T 11 ms S 0 kB R 0 kB >
πΆ G 710 ms I 708 ms T 0 ms S 0 kB R 0 kB <
πΆ G 833 ms I 827 ms T 5 ms S 0 kB R 0 kB h
πΆ G 764 ms I 762 ms T 1 ms S 0 kB R 0 kB 1
πΆ G 726 ms I 725 ms T 0 ms S 0 kB R 0 kB align
πΆ G 864 ms I 857 ms T 6 ms S 0 kB R 0 kB ="
πΆ G 813 ms I 808 ms T 4 ms S 0 kB R 0 kB center
πΆ G 698 ms I 697 ms T 0 ms S 0 kB R 0 kB ">
πΆ G 746 ms I 739 ms T 6 ms S 0 kB R 0 kB Hi
πΆ G 710 ms I 709 ms T 0 ms S 0 kB R 0 kB οΏ½
πΆ G 717 ms I 714 ms T 2 ms S 0 kB R 0 kB
Generated tokens: 16
Avg tokens / second: 1.31
Avg generation time: 764.69 ms
Avg inference time: 761.06 ms
Avg transfer time: 2.69 ms
Your PR:
@b4rtaz β /workspaces/distributed-llama (main) $ ./main inference --model ./dllama_meta-llama-3-8b_q40.bin --tokenizer dllama_meta-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4
...
β© Loaded 6175568 kB
πΆ G 568 ms I 567 ms T 1 ms S 0 kB R 0 kB Hello
πΆ G 642 ms I 642 ms T 0 ms S 0 kB R 0 kB world
πΆ G 579 ms I 578 ms T 0 ms S 0 kB R 0 kB !
πΆ G 566 ms I 565 ms T 0 ms S 0 kB R 0 kB
πΆ G 646 ms I 643 ms T 1 ms S 0 kB R 0 kB I
πΆ G 563 ms I 562 ms T 0 ms S 0 kB R 0 kB am
πΆ G 818 ms I 785 ms T 32 ms S 0 kB R 0 kB a
πΆ G 593 ms I 585 ms T 7 ms S 0 kB R 0 kB computer
πΆ G 761 ms I 737 ms T 23 ms S 0 kB R 0 kB science
πΆ G 579 ms I 579 ms T 0 ms S 0 kB R 0 kB student
πΆ G 566 ms I 564 ms T 1 ms S 0 kB R 0 kB in
πΆ G 625 ms I 623 ms T 0 ms S 0 kB R 0 kB China
πΆ G 618 ms I 616 ms T 1 ms S 0 kB R 0 kB .
πΆ G 573 ms I 573 ms T 0 ms S 0 kB R 0 kB I
πΆ G 690 ms I 657 ms T 32 ms S 0 kB R 0 kB have
πΆ G 668 ms I 646 ms T 22 ms S 0 kB R 0 kB been
Generated tokens: 16
Avg tokens / second: 1.59
Avg generation time: 628.44 ms
Avg inference time: 620.12 ms
Avg transfer time: 7.50 ms
Hi @b4rtaz
I managed to get a significant speed up on my machine with the following changes
I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp
From my initial testing it definitely appears to be faster
With 1 worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn: Using AVX2 instructionsπ‘ arch: llama2 π‘ dim: 4096 π‘ hiddenDim: 14336 π‘ nLayers: 32 π‘ nHeads: 32 π‘ nKvHeads: 8 π‘ vocabSize: 128256 π‘ seqLen: 2048 π‘ nSlices: 2 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128001 π ropeCache: 16384 kB β© Loaded 6175568 kB πΆ G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello πΆ G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World πΆ G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB ! πΆ G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB πΆ G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I πΆ G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm πΆ G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a πΆ G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot πΆ G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB . πΆ G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB πΆ G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I πΆ G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm πΆ G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here πΆ G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to πΆ G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help πΆ G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you πΆ G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with πΆ G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any πΆ G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions πΆ G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20 Avg tokens / second: 2.80 Avg generation time: 357.35 ms Avg inference time: 141.20 ms Avg transfer time: 215.70 ms
Without a worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructionsπ‘ arch: llama2 π‘ dim: 4096 π‘ hiddenDim: 14336 π‘ nLayers: 32 π‘ nHeads: 32 π‘ nKvHeads: 8 π‘ vocabSize: 128256 π‘ seqLen: 2048 π‘ nSlices: 1 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128001 π ropeCache: 32768 kB β© Loaded 6175568 kB πΆ G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello πΆ G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World πΆ G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB ! πΆ G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB πΆ G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I πΆ G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am πΆ G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an πΆ G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI πΆ G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to πΆ G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text πΆ G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based πΆ G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on πΆ G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input πΆ G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I πΆ G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive πΆ G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB . πΆ G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I Generated tokens: 20 Avg tokens / second: 4.33 Avg generation time: 231.20 ms Avg inference time: 229.90 ms Avg transfer time: 0.60 ms
So it does seem to be working correctly at least, and it's definitely much faster than without it.
For reference previously I was getting With worker: Avg tokens / second: 2.60 Avg generation time: 384.90 ms Avg inference time: 184.65 ms Avg transfer time: 199.60 ms
Without worker: Avg tokens / second: 3.69 Avg generation time: 271.15 ms Avg inference time: 269.80 ms Avg transfer time: 0.90 ms
So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)