Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

Performance improvements on Arm for legacy and k-quants #453

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

This PR adds matrix multiplication implementations legacy and k-quants on __aarch64__ that are significantly more performant.

The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for Q4_0 and Q8_0 is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).

cpu_info model_filename size test t/s (main) t/s (PR) Speedup
Apple M2 Max (+fp16+dotprod) q80 6.67 GiB pp512 63.33 85.46 1.599
Apple M2 Max (+fp16+dotprod) q40 3.56 GiB pp512 55.65 88.97 1.349
Apple M2 Max (+fp16+dotprod) q41 3.95 GiB pp512 22.51 75.98 3.375
Apple M2 Max (+fp16+dotprod) q50 4.33 GiB pp512 19.94 71.91 3.606
Apple M2 Max (+fp16+dotprod) q51 4.72 GiB pp512 17.42 61.54 3.533
Apple M2 Max (+fp16+dotprod) q2ks 2.16 GiB pp512 23.01 69.15 3.001
Apple M2 Max (+fp16+dotprod) q3ks 2.75 GiB pp512 16.98 52.05 3.065
Apple M2 Max (+fp16+dotprod) q4ks 3.59 GiB pp512 25.88 74.59 2.882
Apple M2 Max (+fp16+dotprod) q5ks 4.33 GiB pp512 19.58 57.69 2.946
Apple M2 Max (+fp16+dotprod) q6k 5.15 GiB pp512 18.17 52.79 2.905
Apple M2 Max (+fp16+dotprod) iq4xs 3.37 GiB pp512 23.72 72.03 3.037
Apple M2 Max (+fp16+dotprod) q80 6.67 GiB tg128 15.68 16.27 1.038
Apple M2 Max (+fp16+dotprod) q40 3.56 GiB tg128 27.06 27.63 1.021
Apple M2 Max (+fp16+dotprod) q41 3.95 GiB tg128 19.44 25.24 1.298
Apple M2 Max (+fp16+dotprod) q50 4.33 GiB tg128 17.46 19.22 1.101
Apple M2 Max (+fp16+dotprod) q51 4.72 GiB tg128 15.25 17.99 1.180
Apple M2 Max (+fp16+dotprod) q2ks 2.16 GiB tg128 19.64 26.14 1.331
Apple M2 Max (+fp16+dotprod) q3ks 2.75 GiB tg128 15.07 18.00 1.194
Apple M2 Max (+fp16+dotprod) q4ks 3.59 GiB tg128 21.59 26.93 1.247
Apple M2 Max (+fp16+dotprod) q5ks 4.33 GiB tg128 17.49 18.75 1.072
Apple M2 Max (+fp16+dotprod) q6k 5.15 GiB tg128 15.75 19.97 1.268
Apple M2 Max (+fp16+dotprod) iq4xs 3.37 GiB tg128 21.14 23.30 1.102

As llamafile performance on my M2 Max laptop is lower compared to mainline llama.cpp, I also integrated into current lamma.cpp (build 2980, commit hash dacfcebd) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the master llama.cpp branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X for Q2_K_S.

model size params test t/s (master) t/s (PR) Speedup
llama 7B Q8_0 6.67 GiB 6.74 B pp512 78.17 ± 1.18 96.78 ± 0.25 1.238
llama 7B Q4_0 3.56 GiB 6.74 B pp512 68.04 ± 1.18 79.32 ± 0.76 1.166
llama 7B Q4_1 3.95 GiB 6.74 B pp512 37.51 ± 0.61 67.96 ± 0.74 1.812
llama 7B Q5_0 4.33 GiB 6.74 B pp512 30.24 ± 0.12 70.86 ± 0.03 2.343
llama 7B Q5_1 4.72 GiB 6.74 B pp512 26.27 ± 0.09 60.84 ± 0.05 2.316
llama 7B Q2_K_S 2.16 GiB 6.74 B pp512 32.98 ± 1.47 85.53 ± 0.20 2.593
llama 7B Q3_K_S 2.75 GiB 6.74 B pp512 26.01 ± 0.02 62.02 ± 0.73 2.385
llama 7B Q4_K_S 3.59 GiB 6.74 B pp512 44.62 ± 0.80 77.01 ± 1.22 1.726
llama 7B Q5_K_S 4.33 GiB 6.74 B pp512 29.31 ± 0.04 69.16 ± 1.17 2.360
llama 7B Q6_K 5.15 GiB 6.74 B pp512 28.07 ± 0.03 62.85 ± 0.96 2.239
llama 7B Q8_0 6.67 GiB 6.74 B tg128 16.35 ± 0.10 16.74 ± 0.06 1.024
llama 7B Q4_0 3.56 GiB 6.74 B tg128 27.28 ± 0.10 29.59 ± 0.08 1.085
llama 7B Q4_1 3.95 GiB 6.74 B tg128 25.15 ± 0.16 26.97 ± 0.13 1.072
llama 7B Q5_0 4.33 GiB 6.74 B tg128 22.08 ± 0.83 24.18 ± 0.15 1.095
llama 7B Q5_1 4.72 GiB 6.74 B tg128 20.45 ± 0.45 21.73 ± 0.26 1.063
llama 7B Q2_K_S 2.16 GiB 6.74 B tg128 28.34 ± 0.20 37.59 ± 0.32 1.326
llama 7B Q3_K_S 2.75 GiB 6.74 B tg128 22.73 ± 0.03 26.08 ± 0.09 1.146
llama 7B Q4_K_S 3.59 GiB 6.74 B tg128 26.56 ± 0.10 27.82 ± 0.32 1.047
llama 7B Q5_K_S 4.33 GiB 6.74 B tg128 22.11 ± 0.18 23.73 ± 0.12 1.074
llama 7B Q6_K_S 5.15 GiB 6.74 B tg128 19.45 ± 0.13 20.52 ± 0.06 1.055
ikawrakow commented 1 month ago

I forgot to add a Q8_0 implementation (required because of the reordering of the quantized activations), so converting to draft until I add it.

jart commented 1 month ago

Here's the improvements on my Mac Studio. Enormous gains for Q5_K_M, Q6_K, and Q5_0!! I'm actually very pleased that you're optimizing the legacy quants too, due to weird new models like IBM Granite 34b.

cpu_info model_filename size test t/s before t/s after t/s speedup
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB pp512 693.92 883.96 1.27x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB tg16 70.39 103.10 1.46x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB pp512 222.32 617.74 2.78x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB tg16 96.01 96.93 1.01x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB pp512 244.09 658.62 2.70x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB tg16 93.74 103.06 1.10x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB pp512 245.62 809.91 3.30x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB tg16 96.11 106.78 1.11x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB pp512 625.47 943.14 1.51x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB tg16 129.34 124.60 0.96x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB pp512 249.27 694.66 2.79x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB tg16 108.34 105.45 0.97x

The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when n=1. The special codepaths should only run when they add value.

cpu_info model_filename size test t/s before t/s after t/s speedup
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB pp512 66.53 66.53 1.00x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB tg16 4.26 4.26 1.00x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB pp512 44.92 55.41 1.23x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB tg16 8.38 7.90 0.94x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB pp512 18.20 37.59 2.07x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB tg16 11.48 9.66 0.84x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB pp512 19.38 41.25 2.13x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB tg16 13.41 10.22 0.76x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB pp512 17.64 46.45 2.63x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB tg16 11.83 11.12 0.94x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB pp512 18.80 44.74 2.38x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB tg16 14.54 14.79 1.02x
ikawrakow commented 1 month ago

However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that?

TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well.

jart commented 1 month ago

We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers.

Janghou commented 6 days ago

FYI, an RPI5 won't throttle with an active cooler or case fan.

Anyhow you can test if a RPI5 has throttled:

> vcgencmd get_throttled
throttled=0x0

If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power.

https://www.raspberrypi.com/documentation/computers/os.html#get_throttled