Faster AVX2 prompt processing for k-quants and IQ4_XS

ikawrakow commented 2 months ago

As discussed elsewhere, here is a PR that improves AVX2 prompt processing for k-quants and IQ4_XS by a large margin. I did not manage to get the speed gains via tinyBlas, so I just added a call in llamafile_sgemm() to a separate function that performs the matrix multiplication.

The table shows a comparison between prompt processing speed on master and with this PR. Not having the llama-bench tool here and not knowing how to better measure performance, I just used the perplexity tool to measure time for a batch of 512 tokens to get these values. Tested on a 16-core Ryzen-7950X CPU with a 7B LLaMA model

Quants	PP-512 (PR)	PP-512 (master)	Speedup
Q2_K_S	157.5	111.5	1.412
Q3_K_S	169.8	81.3	2.089
Q4_K_S	161.0	105.1	1.531
Q5_K_S	146.7	72.7	2.017
Q6_K	171.8	74.4	2.308
IQ4_XS	147.6	74.1	1.992

For reference, here is what I measure on my system for fp16 and quants not affected by this PR:

Quants	PP-512 (master)
fp16	136.5
Q8_0	129.6
Q4_0	101.4
Q4_1	66.8
Q5_0	56.5
Q5_1	54.2

I.e., all k-quants and IQ4_XS are now faster than fp16!

The speedup in this PR is in most cases better compared to what I reported here due to some additional refinements that I have added since this post, but a few percent slower compared to what I get in my private llama.cpp fork (with Q2_K_S having the most noticeable difference as I get 178 t/s there). Being new to llamafile, I'm not sure what is causing such performance differences for the exact same matrix multiplication implementation.

The same approach as here results in huge performance gains for the other i-quants (IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S). But having modified these quants in my repository in ways that make them incompatible with mainline llama.cpp i-quants, I have left this part for a future PR.

The Ryzen-7950X implements various parts of the AVX512 specification. To make sure that this PR provides speedup on non-AVX512 CPUs, I also tested on an older Ryzen-5975WX 32-core CPU. Here I get the following performance for fp16 and unaffected quants:

Quants	PP-512 (master)
fp16	108.7
Q8_0	135.1
Q4_0	118.8

For k-quants and IQ4_XS we have

Quants	PP-512 (PR)	PP-512 (master)	Speedup
Q2_K_S	199.2	153.7	1.296
Q3_K_S	191.8	115.3	1.663
Q4_K_S	182.2	140.6	1.296
Q5_K_S	171.8	102.8	1.671
Q6_K	187.5	107.8	1.739
IQ4_XS	191.8	110.6	1.734

jart commented 1 month ago

This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x! Here are my measurements, on three different computers, for three different models.

Iwan Kawrakow's new GEMM function for K-quants

Before: 89c189e9f8212c45621254bce0599e4b49568a4d After: ddb9a8c55281c029961cb0d06a5b43676cbb6ac8

prompt evalution speed (a.k.a. prefill) in tokens per second

MODEL	quant	microprocessor	before	after	speedup
TinyLLaMA 1.1B	Q2_K	Intel i9-9900	204	340	1.66x
TinyLLaMA 1.1B	Q3_K_S	Intel i9-9900	160	317	1.98x
TinyLLaMA 1.1B	Q3_K_M	Intel i9-9900	174	309	1.77x
TinyLLaMA 1.1B	Q4_0	Intel i9-9900	167	-	-
TinyLLaMA 1.1B	Q5_K_M	Intel i9-9900	147	280	1.90x
TinyLLaMA 1.1B	Q8_0	Intel i9-9900	219	-	-
TinyLLaMA 1.1B	F16	Intel i9-9900	251	-	-
TinyLLaMA 1.1B	BF16	Intel i9-9900	222	-	-
TinyLLaMA 1.1B	Q2_K	Intel i9-14900K	300	600	2.00x
TinyLLaMA 1.1B	Q3_K_S	Intel i9-14900K	289	606	2.10x
TinyLLaMA 1.1B	Q3_K_M	Intel i9-14900K	316	606	1.92x
TinyLLaMA 1.1B	Q4_0	Intel i9-14900K	418	-	-
TinyLLaMA 1.1B	Q5_K_M	Intel i9-14900K	275	570	2.07x
TinyLLaMA 1.1B	Q8_0	Intel i9-14900K	467	-	-
TinyLLaMA 1.1B	F16	Intel i9-14900K	405	-	-
TinyLLaMA 1.1B	BF16	Intel i9-14900K	97	-	-
TinyLLaMA 1.1B	Q2_K	Ryzen 7995WX	1350	1667	1.23x
TinyLLaMA 1.1B	Q3_K_S	Ryzen 7995WX	1181	1648	1.39x
TinyLLaMA 1.1B	Q3_K_M	Ryzen 7995WX	1248	1636	1.31x
TinyLLaMA 1.1B	Q4_0	Ryzen 7995WX	1379	-	-
TinyLLaMA 1.1B	Q5_K_M	Ryzen 7995WX	961	1626	1.69x
TinyLLaMA 1.1B	F16	Ryzen 7995WX	1230	-	-
TinyLLaMA 1.1B	BF16	Ryzen 7995WX	1800	-	-
LLaMA 3 8B	Q4_0	Intel i9-9900	27	-	-
LLaMA 3 8B	Q4_K_M	Intel i9-9900	28	41	1.46x
LLaMA 3 8B	Q4_0	Intel i9-14900K	62	-	-
LLaMA 3 8B	Q4_K_M	Intel i9-14900K	57	90	1.57x
LLaMA 3 8B	F16	Intel i9-14900K	59	-	-
LLaMA 3 8B	Q3_K_S	Ryzen 7995WX	225	416	1.84x
LLaMA 3 8B	Q4_0	Ryzen 7995WX	278	-	-
LLaMA 3 8B	Q4_K_S	Ryzen 7995WX	188	386	2.05x
LLaMA 3 8B	F16	Ryzen 7995WX	357	-	-
LLaMA 3 8B	BF16	Ryzen 7995WX	508	-	-
LLaMA 3 70B	Q2_K	Ryzen 7995WX	31	51	1.65x
LLaMA 3 70B	Q3_K_S	Ryzen 7995WX	23	44	1.91x
LLaMA 3 70B	Q4_0	Ryzen 7995WX	31	-	-
LLaMA 3 70B	F16	Ryzen 7995WX	42	-	-
LLaMA 3 70B	BF16	Ryzen 7995WX	65	-	-

text generation speed (a.k.a. prediction) in tokens per second

MODEL	quant	microprocessor	before	after	speedup
TinyLLaMA 1.1B	Q2_K	Intel i9-9900	48	57	1.18x
TinyLLaMA 1.1B	Q3_K_S	Intel i9-9900	44	50	1.13x
TinyLLaMA 1.1B	Q3_K_M	Intel i9-9900	42	47	1.11x
TinyLLaMA 1.1B	Q4_0	Intel i9-9900	34	-	-
TinyLLaMA 1.1B	Q5_K_M	Intel i9-9900	32	35	1.09x
TinyLLaMA 1.1B	Q8_0	Intel i9-9900	25	-	-
TinyLLaMA 1.1B	F16	Intel i9-9900	15	-	-
TinyLLaMA 1.1B	BF16	Intel i9-9000	15	-	-
TinyLLaMA 1.1B	Q2_K	Intel i9-14900K	102	129	1.26x
TinyLLaMA 1.1B	Q3_K_S	Intel i9-14900K	99	125	1.26x
TinyLLaMA 1.1B	Q3_K_M	Intel i9-14900K	96	113	1.17x
TinyLLaMA 1.1B	Q4_0	Intel i9-14900K	86	-	-
TinyLLaMA 1.1B	Q5_K_M	Intel i9-14900K	74	83	1.12x
TinyLLaMA 1.1B	Q8_0	Intel i9-14900K	64	-	-
TinyLLaMA 1.1B	F16	Intel i9-14900K	41	-	-
TinyLLaMA 1.1B	BF16	Intel i9-14900K	68	-	-
TinyLLaMA 1.1B	Q2_K	Ryzen 7995WX	129	160	1.24x
TinyLLaMA 1.1B	Q3_K_S	Ryzen 7995WX	123	158	1.28x
TinyLLaMA 1.1B	Q3_K_M	Ryzen 7995WX	122	160	1.31x
TinyLLaMA 1.1B	Q4_0	Ryzen 7995WX	129	-	-
TinyLLaMA 1.1B	Q5_K_M	Ryzen 7995WX	109	147	1.34x
TinyLLaMA 1.1B	F16	Ryzen 7995WX	88	-	-
TinyLLaMA 1.1B	BF16	Ryzen 7995WX	79	-	-
LLaMA 3 8B	Q4_0	Intel i9-9900	6	-	-
LLaMA 3 8B	Q4_K_M	Intel i9-9900	6	6	1.00x
LLaMA 3 8B	Q4_0	Intel i9-14900K	16	-	-
LLaMA 3 8B	Q4_K_M	Intel i9-14900K	15	16	1.06x
LLaMA 3 8B	F16	Intel i9-14900K	6	-	-
LLaMA 3 8B	Q3_K_S	Ryzen 7995WX	34	46	1.35x
LLaMA 3 8B	Q4_0	Ryzen 7995WX	37	-	-
LLaMA 3 8B	Q4_K_S	Ryzen 7995WX	32	42	1.31x
LLaMA 3 8B	F16	Ryzen 7995WX	19	-	-
LLaMA 3 8B	BF16	Ryzen 7995WX	20	-	-
LLaMA 3 70B	Q2_K	Ryzen 7995WX	6	8	1.33x
LLaMA 3 70B	Q3_K_S	Ryzen 7995WX	6	7	1.16x
LLaMA 3 70B	Q4_0	Ryzen 7995WX	5	-	-
LLaMA 3 70B	F16	Ryzen 7995WX	2	-	-
LLaMA 3 70B	BF16	Ryzen 7995WX	2	-	-

mrdomino commented 1 month ago

bill and ted going "whoa"

stlhood commented 1 month ago

@ikawrakow thank you for this major contribution to the project!

ikawrakow commented 1 month ago

Looks good to me. Once I get a release out, how would you like to announce it the world? I would like to write a blog post. If you write your own, then I'm happy to tweet that.

I'm not much into blogging, so if you like writing about this, please go ahead.

Mozilla-Ocho / llamafile