Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

Optimized matrix multiplications for i-quants on __aarch64__ #464

Closed ikawrakow closed 3 weeks ago

ikawrakow commented 3 weeks ago

i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.

The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.

cpu_info model_filename size threads test t/s (main) t/s (PR) Speedup
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 8 pp512 16.50 61.16 3.707
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 8 pp512 19.09 57.42 3.008
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 8 pp512 13.32 46.37 3.481
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 8 pp512 12.30 48.60 3.951
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 8 pp512 12.11 49.70 4.104
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 4 tg128 7.73 11.03 1.427
M2 Max (+fp16+dotprod) iq2xxs 1.73 GiB 8 tg128 14.64 20.09 1.372
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 4 tg128 8.56 10.72 1.252
M2 Max (+fp16+dotprod) iq2xs 1.89 GiB 8 tg128 16.17 19.91 1.231
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 4 tg128 6.34 7.44 1.174
M2 Max (+fp16+dotprod) iq2m 2.20 GiB 8 tg128 12.03 13.60 1.106
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 4 tg128 5.98 6.78 1.134
M2 Max (+fp16+dotprod) iq3xxs 2.41 GiB 8 tg128 10.93 11.94 1.092
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 4 tg128 5.62 5.95 1.059
M2 Max (+fp16+dotprod) iq3m 2.90 GiB 8 tg128 10.39 10.71 1.031