Closed ikawrakow closed 1 month ago
This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x! Here are my measurements, on three different computers, for three different models.
Before: 89c189e9f8212c45621254bce0599e4b49568a4d After: ddb9a8c55281c029961cb0d06a5b43676cbb6ac8
MODEL | quant | microprocessor | before | after | speedup |
---|---|---|---|---|---|
TinyLLaMA 1.1B | Q2_K | Intel i9-9900 | 204 | 340 | 1.66x |
TinyLLaMA 1.1B | Q3_K_S | Intel i9-9900 | 160 | 317 | 1.98x |
TinyLLaMA 1.1B | Q3_K_M | Intel i9-9900 | 174 | 309 | 1.77x |
TinyLLaMA 1.1B | Q4_0 | Intel i9-9900 | 167 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Intel i9-9900 | 147 | 280 | 1.90x |
TinyLLaMA 1.1B | Q8_0 | Intel i9-9900 | 219 | - | - |
TinyLLaMA 1.1B | F16 | Intel i9-9900 | 251 | - | - |
TinyLLaMA 1.1B | BF16 | Intel i9-9900 | 222 | - | - |
TinyLLaMA 1.1B | Q2_K | Intel i9-14900K | 300 | 600 | 2.00x |
TinyLLaMA 1.1B | Q3_K_S | Intel i9-14900K | 289 | 606 | 2.10x |
TinyLLaMA 1.1B | Q3_K_M | Intel i9-14900K | 316 | 606 | 1.92x |
TinyLLaMA 1.1B | Q4_0 | Intel i9-14900K | 418 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Intel i9-14900K | 275 | 570 | 2.07x |
TinyLLaMA 1.1B | Q8_0 | Intel i9-14900K | 467 | - | - |
TinyLLaMA 1.1B | F16 | Intel i9-14900K | 405 | - | - |
TinyLLaMA 1.1B | BF16 | Intel i9-14900K | 97 | - | - |
TinyLLaMA 1.1B | Q2_K | Ryzen 7995WX | 1350 | 1667 | 1.23x |
TinyLLaMA 1.1B | Q3_K_S | Ryzen 7995WX | 1181 | 1648 | 1.39x |
TinyLLaMA 1.1B | Q3_K_M | Ryzen 7995WX | 1248 | 1636 | 1.31x |
TinyLLaMA 1.1B | Q4_0 | Ryzen 7995WX | 1379 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Ryzen 7995WX | 961 | 1626 | 1.69x |
TinyLLaMA 1.1B | F16 | Ryzen 7995WX | 1230 | - | - |
TinyLLaMA 1.1B | BF16 | Ryzen 7995WX | 1800 | - | - |
LLaMA 3 8B | Q4_0 | Intel i9-9900 | 27 | - | - |
LLaMA 3 8B | Q4_K_M | Intel i9-9900 | 28 | 41 | 1.46x |
LLaMA 3 8B | Q4_0 | Intel i9-14900K | 62 | - | - |
LLaMA 3 8B | Q4_K_M | Intel i9-14900K | 57 | 90 | 1.57x |
LLaMA 3 8B | F16 | Intel i9-14900K | 59 | - | - |
LLaMA 3 8B | Q3_K_S | Ryzen 7995WX | 225 | 416 | 1.84x |
LLaMA 3 8B | Q4_0 | Ryzen 7995WX | 278 | - | - |
LLaMA 3 8B | Q4_K_S | Ryzen 7995WX | 188 | 386 | 2.05x |
LLaMA 3 8B | F16 | Ryzen 7995WX | 357 | - | - |
LLaMA 3 8B | BF16 | Ryzen 7995WX | 508 | - | - |
LLaMA 3 70B | Q2_K | Ryzen 7995WX | 31 | 51 | 1.65x |
LLaMA 3 70B | Q3_K_S | Ryzen 7995WX | 23 | 44 | 1.91x |
LLaMA 3 70B | Q4_0 | Ryzen 7995WX | 31 | - | - |
LLaMA 3 70B | F16 | Ryzen 7995WX | 42 | - | - |
LLaMA 3 70B | BF16 | Ryzen 7995WX | 65 | - | - |
MODEL | quant | microprocessor | before | after | speedup |
---|---|---|---|---|---|
TinyLLaMA 1.1B | Q2_K | Intel i9-9900 | 48 | 57 | 1.18x |
TinyLLaMA 1.1B | Q3_K_S | Intel i9-9900 | 44 | 50 | 1.13x |
TinyLLaMA 1.1B | Q3_K_M | Intel i9-9900 | 42 | 47 | 1.11x |
TinyLLaMA 1.1B | Q4_0 | Intel i9-9900 | 34 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Intel i9-9900 | 32 | 35 | 1.09x |
TinyLLaMA 1.1B | Q8_0 | Intel i9-9900 | 25 | - | - |
TinyLLaMA 1.1B | F16 | Intel i9-9900 | 15 | - | - |
TinyLLaMA 1.1B | BF16 | Intel i9-9000 | 15 | - | - |
TinyLLaMA 1.1B | Q2_K | Intel i9-14900K | 102 | 129 | 1.26x |
TinyLLaMA 1.1B | Q3_K_S | Intel i9-14900K | 99 | 125 | 1.26x |
TinyLLaMA 1.1B | Q3_K_M | Intel i9-14900K | 96 | 113 | 1.17x |
TinyLLaMA 1.1B | Q4_0 | Intel i9-14900K | 86 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Intel i9-14900K | 74 | 83 | 1.12x |
TinyLLaMA 1.1B | Q8_0 | Intel i9-14900K | 64 | - | - |
TinyLLaMA 1.1B | F16 | Intel i9-14900K | 41 | - | - |
TinyLLaMA 1.1B | BF16 | Intel i9-14900K | 68 | - | - |
TinyLLaMA 1.1B | Q2_K | Ryzen 7995WX | 129 | 160 | 1.24x |
TinyLLaMA 1.1B | Q3_K_S | Ryzen 7995WX | 123 | 158 | 1.28x |
TinyLLaMA 1.1B | Q3_K_M | Ryzen 7995WX | 122 | 160 | 1.31x |
TinyLLaMA 1.1B | Q4_0 | Ryzen 7995WX | 129 | - | - |
TinyLLaMA 1.1B | Q5_K_M | Ryzen 7995WX | 109 | 147 | 1.34x |
TinyLLaMA 1.1B | F16 | Ryzen 7995WX | 88 | - | - |
TinyLLaMA 1.1B | BF16 | Ryzen 7995WX | 79 | - | - |
LLaMA 3 8B | Q4_0 | Intel i9-9900 | 6 | - | - |
LLaMA 3 8B | Q4_K_M | Intel i9-9900 | 6 | 6 | 1.00x |
LLaMA 3 8B | Q4_0 | Intel i9-14900K | 16 | - | - |
LLaMA 3 8B | Q4_K_M | Intel i9-14900K | 15 | 16 | 1.06x |
LLaMA 3 8B | F16 | Intel i9-14900K | 6 | - | - |
LLaMA 3 8B | Q3_K_S | Ryzen 7995WX | 34 | 46 | 1.35x |
LLaMA 3 8B | Q4_0 | Ryzen 7995WX | 37 | - | - |
LLaMA 3 8B | Q4_K_S | Ryzen 7995WX | 32 | 42 | 1.31x |
LLaMA 3 8B | F16 | Ryzen 7995WX | 19 | - | - |
LLaMA 3 8B | BF16 | Ryzen 7995WX | 20 | - | - |
LLaMA 3 70B | Q2_K | Ryzen 7995WX | 6 | 8 | 1.33x |
LLaMA 3 70B | Q3_K_S | Ryzen 7995WX | 6 | 7 | 1.16x |
LLaMA 3 70B | Q4_0 | Ryzen 7995WX | 5 | - | - |
LLaMA 3 70B | F16 | Ryzen 7995WX | 2 | - | - |
LLaMA 3 70B | BF16 | Ryzen 7995WX | 2 | - | - |
@ikawrakow thank you for this major contribution to the project!
Looks good to me. Once I get a release out, how would you like to announce it the world? I would like to write a blog post. If you write your own, then I'm happy to tweet that.
I'm not much into blogging, so if you like writing about this, please go ahead.
As discussed elsewhere, here is a PR that improves AVX2 prompt processing for k-quants and
IQ4_XS
by a large margin. I did not manage to get the speed gains via tinyBlas, so I just added a call inllamafile_sgemm()
to a separate function that performs the matrix multiplication.The table shows a comparison between prompt processing speed on master and with this PR. Not having the
llama-bench
tool here and not knowing how to better measure performance, I just used theperplexity
tool to measure time for a batch of 512 tokens to get these values. Tested on a 16-core Ryzen-7950X CPU with a 7B LLaMA modelFor reference, here is what I measure on my system for
fp16
and quants not affected by this PR:I.e., all k-quants and
IQ4_XS
are now faster thanfp16
!The speedup in this PR is in most cases better compared to what I reported here due to some additional refinements that I have added since this post, but a few percent slower compared to what I get in my private
llama.cpp
fork (withQ2_K_S
having the most noticeable difference as I get 178 t/s there). Being new tollamafile
, I'm not sure what is causing such performance differences for the exact same matrix multiplication implementation.The same approach as here results in huge performance gains for the other i-quants (
IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S
). But having modified these quants in my repository in ways that make them incompatible with mainlinellama.cpp
i-quants, I have left this part for a future PR.The Ryzen-7950X implements various parts of the
AVX512
specification. To make sure that this PR provides speedup on non-AVX512
CPUs, I also tested on an older Ryzen-5975WX 32-core CPU. Here I get the following performance forfp16
and unaffected quants:For k-quants and
IQ4_XS
we have