Closed DifferentialityDevelopment closed 1 month ago
Nice!!! Could you create a PR without server, I'll merge it immediately.
Nice!!! Could you create a PR without server, I'll merge it immediately.
Whoops! my bad! Sure will do, for simplicity sake I'm going to close this PR and open another
Hi @b4rtaz
I managed to get a significant speed up on my machine with the following changes
I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp
From my initial testing it definitely appears to be faster
With 1 worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn: Using AVX2 instructionsπ‘ arch: llama2 π‘ dim: 4096 π‘ hiddenDim: 14336 π‘ nLayers: 32 π‘ nHeads: 32 π‘ nKvHeads: 8 π‘ vocabSize: 128256 π‘ seqLen: 2048 π‘ nSlices: 2 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128001 π ropeCache: 16384 kB β© Loaded 6175568 kB πΆ G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello πΆ G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World πΆ G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB ! πΆ G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB πΆ G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I πΆ G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm πΆ G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a πΆ G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot πΆ G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB . πΆ G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB πΆ G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I πΆ G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm πΆ G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here πΆ G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to πΆ G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help πΆ G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you πΆ G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with πΆ G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any πΆ G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions πΆ G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20 Avg tokens / second: 2.80 Avg generation time: 357.35 ms Avg inference time: 141.20 ms Avg transfer time: 215.70 ms
Without a worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructionsπ‘ arch: llama2 π‘ dim: 4096 π‘ hiddenDim: 14336 π‘ nLayers: 32 π‘ nHeads: 32 π‘ nKvHeads: 8 π‘ vocabSize: 128256 π‘ seqLen: 2048 π‘ nSlices: 1 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128001 π ropeCache: 32768 kB β© Loaded 6175568 kB πΆ G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello πΆ G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World πΆ G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB ! πΆ G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB πΆ G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I πΆ G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am πΆ G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an πΆ G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI πΆ G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to πΆ G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text πΆ G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based πΆ G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on πΆ G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the πΆ G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input πΆ G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I πΆ G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive πΆ G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB . πΆ G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I Generated tokens: 20 Avg tokens / second: 4.33 Avg generation time: 231.20 ms Avg inference time: 229.90 ms Avg transfer time: 0.60 ms
So it does seem to be working correctly at least, and it's definitely much faster than without it.
For reference previously I was getting With worker: Avg tokens / second: 2.60 Avg generation time: 384.90 ms Avg inference time: 184.65 ms Avg transfer time: 199.60 ms
Without worker: Avg tokens / second: 3.69 Avg generation time: 271.15 ms Avg inference time: 269.80 ms Avg transfer time: 0.90 ms
So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)