Performance improvements on Arm for legacy and k-quants

ikawrakow commented 1 month ago

This PR adds matrix multiplication implementations legacy and k-quants on __aarch64__ that are significantly more performant.

The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for Q4_0 and Q8_0 is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).

cpu_info	model_filename	size	test	t/s (main)	t/s (PR)	Speedup
Apple M2 Max (+fp16+dotprod)	q80	6.67 GiB	pp512	63.33	85.46	1.599
Apple M2 Max (+fp16+dotprod)	q40	3.56 GiB	pp512	55.65	88.97	1.349
Apple M2 Max (+fp16+dotprod)	q41	3.95 GiB	pp512	22.51	75.98	3.375
Apple M2 Max (+fp16+dotprod)	q50	4.33 GiB	pp512	19.94	71.91	3.606
Apple M2 Max (+fp16+dotprod)	q51	4.72 GiB	pp512	17.42	61.54	3.533
Apple M2 Max (+fp16+dotprod)	q2ks	2.16 GiB	pp512	23.01	69.15	3.001
Apple M2 Max (+fp16+dotprod)	q3ks	2.75 GiB	pp512	16.98	52.05	3.065
Apple M2 Max (+fp16+dotprod)	q4ks	3.59 GiB	pp512	25.88	74.59	2.882
Apple M2 Max (+fp16+dotprod)	q5ks	4.33 GiB	pp512	19.58	57.69	2.946
Apple M2 Max (+fp16+dotprod)	q6k	5.15 GiB	pp512	18.17	52.79	2.905
Apple M2 Max (+fp16+dotprod)	iq4xs	3.37 GiB	pp512	23.72	72.03	3.037
Apple M2 Max (+fp16+dotprod)	q80	6.67 GiB	tg128	15.68	16.27	1.038
Apple M2 Max (+fp16+dotprod)	q40	3.56 GiB	tg128	27.06	27.63	1.021
Apple M2 Max (+fp16+dotprod)	q41	3.95 GiB	tg128	19.44	25.24	1.298
Apple M2 Max (+fp16+dotprod)	q50	4.33 GiB	tg128	17.46	19.22	1.101
Apple M2 Max (+fp16+dotprod)	q51	4.72 GiB	tg128	15.25	17.99	1.180
Apple M2 Max (+fp16+dotprod)	q2ks	2.16 GiB	tg128	19.64	26.14	1.331
Apple M2 Max (+fp16+dotprod)	q3ks	2.75 GiB	tg128	15.07	18.00	1.194
Apple M2 Max (+fp16+dotprod)	q4ks	3.59 GiB	tg128	21.59	26.93	1.247
Apple M2 Max (+fp16+dotprod)	q5ks	4.33 GiB	tg128	17.49	18.75	1.072
Apple M2 Max (+fp16+dotprod)	q6k	5.15 GiB	tg128	15.75	19.97	1.268
Apple M2 Max (+fp16+dotprod)	iq4xs	3.37 GiB	tg128	21.14	23.30	1.102

As llamafile performance on my M2 Max laptop is lower compared to mainline llama.cpp, I also integrated into current lamma.cpp (build 2980, commit hash dacfcebd) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the master llama.cpp branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X for Q2_K_S.

model	size	params	test	t/s (master)	t/s (PR)	Speedup
llama 7B Q8_0	6.67 GiB	6.74 B	pp512	78.17 ± 1.18	96.78 ± 0.25	1.238
llama 7B Q4_0	3.56 GiB	6.74 B	pp512	68.04 ± 1.18	79.32 ± 0.76	1.166
llama 7B Q4_1	3.95 GiB	6.74 B	pp512	37.51 ± 0.61	67.96 ± 0.74	1.812
llama 7B Q5_0	4.33 GiB	6.74 B	pp512	30.24 ± 0.12	70.86 ± 0.03	2.343
llama 7B Q5_1	4.72 GiB	6.74 B	pp512	26.27 ± 0.09	60.84 ± 0.05	2.316
llama 7B Q2_K_S	2.16 GiB	6.74 B	pp512	32.98 ± 1.47	85.53 ± 0.20	2.593
llama 7B Q3_K_S	2.75 GiB	6.74 B	pp512	26.01 ± 0.02	62.02 ± 0.73	2.385
llama 7B Q4_K_S	3.59 GiB	6.74 B	pp512	44.62 ± 0.80	77.01 ± 1.22	1.726
llama 7B Q5_K_S	4.33 GiB	6.74 B	pp512	29.31 ± 0.04	69.16 ± 1.17	2.360
llama 7B Q6_K	5.15 GiB	6.74 B	pp512	28.07 ± 0.03	62.85 ± 0.96	2.239
llama 7B Q8_0	6.67 GiB	6.74 B	tg128	16.35 ± 0.10	16.74 ± 0.06	1.024
llama 7B Q4_0	3.56 GiB	6.74 B	tg128	27.28 ± 0.10	29.59 ± 0.08	1.085
llama 7B Q4_1	3.95 GiB	6.74 B	tg128	25.15 ± 0.16	26.97 ± 0.13	1.072
llama 7B Q5_0	4.33 GiB	6.74 B	tg128	22.08 ± 0.83	24.18 ± 0.15	1.095
llama 7B Q5_1	4.72 GiB	6.74 B	tg128	20.45 ± 0.45	21.73 ± 0.26	1.063
llama 7B Q2_K_S	2.16 GiB	6.74 B	tg128	28.34 ± 0.20	37.59 ± 0.32	1.326
llama 7B Q3_K_S	2.75 GiB	6.74 B	tg128	22.73 ± 0.03	26.08 ± 0.09	1.146
llama 7B Q4_K_S	3.59 GiB	6.74 B	tg128	26.56 ± 0.10	27.82 ± 0.32	1.047
llama 7B Q5_K_S	4.33 GiB	6.74 B	tg128	22.11 ± 0.18	23.73 ± 0.12	1.074
llama 7B Q6_K_S	5.15 GiB	6.74 B	tg128	19.45 ± 0.13	20.52 ± 0.06	1.055

ikawrakow commented 1 month ago

I forgot to add a Q8_0 implementation (required because of the reordering of the quantized activations), so converting to draft until I add it.

jart commented 1 month ago

Here's the improvements on my Mac Studio. Enormous gains for Q5_K_M, Q6_K, and Q5_0!! I'm actually very pleased that you're optimizing the legacy quants too, due to weird new models like IBM Granite 34b.

cpu_info	model_filename	size	test	t/s before	t/s after	t/s speedup
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	pp512	693.92	883.96	1.27x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	tg16	70.39	103.10	1.46x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	pp512	222.32	617.74	2.78x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	tg16	96.01	96.93	1.01x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	pp512	244.09	658.62	2.70x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	tg16	93.74	103.06	1.10x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	pp512	245.62	809.91	3.30x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	tg16	96.11	106.78	1.11x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	pp512	625.47	943.14	1.51x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	tg16	129.34	124.60	0.96x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	pp512	249.27	694.66	2.79x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	tg16	108.34	105.45	0.97x

The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when n=1. The special codepaths should only run when they add value.

cpu_info	model_filename	size	test	t/s before	t/s after	t/s speedup
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	pp512	66.53	66.53	1.00x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	tg16	4.26	4.26	1.00x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	pp512	44.92	55.41	1.23x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	tg16	8.38	7.90	0.94x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	pp512	18.20	37.59	2.07x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	tg16	11.48	9.66	0.84x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	pp512	19.38	41.25	2.13x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	tg16	13.41	10.22	0.76x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	pp512	17.64	46.45	2.63x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	tg16	11.83	11.12	0.94x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	pp512	18.80	44.74	2.38x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	tg16	14.54	14.79	1.02x

ikawrakow commented 1 month ago

However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that?

TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well.

jart commented 1 month ago

We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers.

Janghou commented 6 days ago

FYI, an RPI5 won't throttle with an active cooler or case fan.

Anyhow you can test if a RPI5 has throttled:

> vcgencmd get_throttled
throttled=0x0

If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power.

https://www.raspberrypi.com/documentation/computers/os.html#get_throttled

Mozilla-Ocho / llamafile

Performance improvements on Arm for legacy and k-quants #453