Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

Faster AVX2 matrix multiplications for MoE models #428

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

This PR is a follow up of PR's #394 and #405 and enables the faster matrix multiplications for legacy and k-quants introduced there also for MoE models.

The following table shows a prompt processing speed comparison between the main branch and this PR for Mixtral8x7B on a Ryzen-7950X CPU

Quantization PP-512 (main) PP-512 (PR) Speedup
Q4_0 59.2 - -
Q4_1 35.3 69.6 1.97
Q5_0 30.6 65.4 2.14
Q5_1 29.5 64.0 2.17
Q2_K_S 66.8 88.9 1.33
Q3_K_S 45.2 85.3 1.89
Q4_K_S 53.4 81.8 1.53
Q5_K_S 38.6 75.0 1.94
Q6_K 41.8 85.6 2.05
IQ4_XS 41.6 76.1 1.83
jart commented 1 month ago

On AMD Ryzen Threadripper PRO 7995WX with Mixtral 8x7b I'm seeing speedups for Q5_K_M as high as 2.6x. Some quick measurements on my end, for a context size of 20900 and a prompt size of 1611 tokens:

quant tok/sec before tok/sec after speedup
Q2_K 153.95 195.15 1.27x
Q5_K_M 121.16 314.70 2.60x

So once again, outstanding work!

P.S. I'm going to be looking into integrating the llama-bench command sometime soon.

ikawrakow commented 1 month ago

@jart

Yes. having llama-bench available would be very useful. Thanks for the Ryzen 7995WX performance numbers. I'm curious to see how the latest version in PR #435 does on that CPU.

Btw, I have done an implementation for ARM_NEON as well. I did it out of curiosity to see what is possible. I'm getting in the range of 2X improvement compared to mainline llama.cpp on my M2 Max. But this is still not better than just using the Accelerate framework, so I'm wondering if there is a benefit of adding this to llamafile.

jart commented 1 month ago

We ow have a llamafile-bench program. I've been running it with this script.

#!/bin/sh
cd ~/llamafile
make -j16 o//llama.cpp/llama-bench/llama-bench || exit
o//llama.cpp/llama-bench/llama-bench \
  $(for f in $(ls -S /weights/TinyLlama-1.1B-Chat-v1.0.*.gguf \
                     /weights/mixtral-8x7b-instruct-v0.1.*.gguf); do
      echo -m $f
    done) \
  "$@"

I also wrote a script you can use to zip up two text file reports into a single report. https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/llama-bench/bench-llamafile-zip.py

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

ikawrakow commented 1 month ago

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

Thanks for the offer, but it would be easier for me to just do it on my M2 laptop. I assume it would run on the Pi once it builds and runs successfully on my laptop with Arm 8.2 settings?