Closed ikawrakow closed 1 month ago
On AMD Ryzen Threadripper PRO 7995WX with Mixtral 8x7b I'm seeing speedups for Q5_K_M as high as 2.6x. Some quick measurements on my end, for a context size of 20900 and a prompt size of 1611 tokens:
quant | tok/sec before | tok/sec after | speedup |
---|---|---|---|
Q2_K | 153.95 | 195.15 | 1.27x |
Q5_K_M | 121.16 | 314.70 | 2.60x |
So once again, outstanding work!
P.S. I'm going to be looking into integrating the llama-bench
command sometime soon.
@jart
Yes. having llama-bench
available would be very useful. Thanks for the Ryzen 7995WX performance numbers. I'm curious to see how the latest version in PR #435 does on that CPU.
Btw, I have done an implementation for ARM_NEON as well. I did it out of curiosity to see what is possible. I'm getting in the range of 2X improvement compared to mainline llama.cpp
on my M2 Max. But this is still not better than just using the Accelerate framework, so I'm wondering if there is a benefit of adding this to llamafile
.
We ow have a llamafile-bench program. I've been running it with this script.
#!/bin/sh
cd ~/llamafile
make -j16 o//llama.cpp/llama-bench/llama-bench || exit
o//llama.cpp/llama-bench/llama-bench \
$(for f in $(ls -S /weights/TinyLlama-1.1B-Chat-v1.0.*.gguf \
/weights/mixtral-8x7b-instruct-v0.1.*.gguf); do
echo -m $f
done) \
"$@"
I also wrote a script you can use to zip up two text file reports into a single report. https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/llama-bench/bench-llamafile-zip.py
Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.
Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.
Thanks for the offer, but it would be easier for me to just do it on my M2 laptop. I assume it would run on the Pi once it builds and runs successfully on my laptop with Arm 8.2 settings?
This PR is a follow up of PR's #394 and #405 and enables the faster matrix multiplications for legacy and k-quants introduced there also for MoE models.
The following table shows a prompt processing speed comparison between the main branch and this PR for Mixtral8x7B on a Ryzen-7950X CPU