Closed ikawrakow closed 1 month ago
I forgot to add a Q8_0
implementation (required because of the reordering of the quantized activations), so converting to draft until I add it.
Here's the improvements on my Mac Studio. Enormous gains for Q5_K_M
, Q6_K
, and Q5_0
!! I'm actually very pleased that you're optimizing the legacy quants too, due to weird new models like IBM Granite 34b.
cpu_info | model_filename | size | test | t/s before | t/s after | t/s speedup |
---|---|---|---|---|---|---|
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | pp512 | 693.92 | 883.96 | 1.27x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | tg16 | 70.39 | 103.10 | 1.46x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | pp512 | 222.32 | 617.74 | 2.78x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | tg16 | 96.01 | 96.93 | 1.01x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | pp512 | 244.09 | 658.62 | 2.70x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | tg16 | 93.74 | 103.06 | 1.10x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | pp512 | 245.62 | 809.91 | 3.30x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | tg16 | 96.11 | 106.78 | 1.11x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q4_0 | 606.53 MiB | pp512 | 625.47 | 943.14 | 1.51x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q4_0 | 606.53 MiB | tg16 | 129.34 | 124.60 | 0.96x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q2_K | 411.41 MiB | pp512 | 249.27 | 694.66 | 2.79x |
Apple M2 Ultra (+fp16+dotprod) | TinyLlama-1.1B-Chat-v1.0.Q2_K | 411.41 MiB | tg16 | 108.34 | 105.45 | 0.97x |
The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when n=1
. The special codepaths should only run when they add value.
cpu_info | model_filename | size | test | t/s before | t/s after | t/s speedup |
---|---|---|---|---|---|---|
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.F16 | 2.05 GiB | pp512 | 66.53 | 66.53 | 1.00x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.F16 | 2.05 GiB | tg16 | 4.26 | 4.26 | 1.00x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | pp512 | 44.92 | 55.41 | 1.23x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q8_0 | 1.09 GiB | tg16 | 8.38 | 7.90 | 0.94x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | pp512 | 18.20 | 37.59 | 2.07x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q6_K | 860.86 MiB | tg16 | 11.48 | 9.66 | 0.84x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | pp512 | 19.38 | 41.25 | 2.13x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q5_K_M | 745.11 MiB | tg16 | 13.41 | 10.22 | 0.76x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | pp512 | 17.64 | 46.45 | 2.63x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q5_0 | 729.84 MiB | tg16 | 11.83 | 11.12 | 0.94x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q2_K | 411.41 MiB | pp512 | 18.80 | 44.74 | 2.38x |
+fp16+dotprod | TinyLlama-1.1B-Chat-v1.0.Q2_K | 411.41 MiB | tg16 | 14.54 | 14.79 | 1.02x |
However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that?
TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well.
We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers.
FYI, an RPI5 won't throttle with an active cooler or case fan.
Anyhow you can test if a RPI5 has throttled:
> vcgencmd get_throttled
throttled=0x0
If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power.
https://www.raspberrypi.com/documentation/computers/os.html#get_throttled
This PR adds matrix multiplication implementations legacy and k-quants on
__aarch64__
that are significantly more performant.The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for
Q4_0
andQ8_0
is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).As llamafile performance on my M2 Max laptop is lower compared to mainline
llama.cpp
, I also integrated into currentlamma.cpp
(build 2980, commit hashdacfcebd
) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the masterllama.cpp
branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X forQ2_K_S
.