ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.32k stars 9.81k forks source link

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Open AndreasKunar opened 5 days ago

AndreasKunar commented 5 days ago

What happened?

Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards. (hardware: Apple MacBook Air M2 10 GPU 24GB RAM)

before PR: make clean git checkout ae8de6d make -j llama-bench ./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model... model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 60.48 ± 0.49
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 14.89 ± 0.20
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 63.50 ± 2.47
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 11.93 ± 3.30

with ngl 99: | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 194.94 ± 0.07 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.81 ± 6.53 |

build: ae8de6d5 (4080)

versions after PR (including current): make clean git checkout 1607a5e make -j llama-bench ./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model... model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 4.11 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 1.86 ± 0.01
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 62.81 ± 2.55
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 14.70 ± 1.97

with ngl 99: | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 186.02 ± 13.18 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.25 ± 3.42 |

build: 1607a5e5 (4081)

The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.

Name and Version

Apple clang version 16.0.0 (clang-1600.0.26.4) Target: arm64-apple-darwin24.1.0 Thread model: posix macOS Sequoia 15.1.1

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

slaren commented 5 days ago

The problem seem that Q4_0 is being converted to Q4_0_4_8, which has bad performance in this CPU. If patched so that it is converted to Q4_0_4_4, the performance is as expected, e.g.:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 96a16dfb..ff27f9cb 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3549,7 +3549,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon()) {
             return GGML_TYPE_Q4_0_4_4;

@chaxu01 does the logic for choosing the type to repack to need to be updated?

ggerganov commented 5 days ago

I can confirm that on Apple, ggml_arm_arch_features.has_i8mm ends up being true, but the defined(__ARM_FEATURE_MATMUL_INT8) macro is false, so the logic decides to repack, but then does not use the optimized path and fallbacks to the slow scalar implementation.

eddnjjn commented 4 days ago

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

chaxu01 commented 4 days ago

@AndreasKunar I would like to confirm whether you built with the arch=...+i8mm option. Due to runtime CPU feature detection, we decided to repack into Q4_0_4_8 as it is supported by the CPU. However, when executing the kernel, the reference implementation is used because the optimized kernel was not included in the build.

AndreasKunar commented 4 days ago

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

@eddnjjn + @chaxu01: I normally do not build with cmake, I built as in my initial report with "make -j llama-bench" and no switches. This is also in accordance with the llama-performance instructions (discussion #4167, which is essential and still current).

I tried a cmake build with your -march flags - it does not degrade performance: cmake -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod -B build cmake --build build --config Release --target llama-bench ./build/bin/llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 56.64 ± 1.37
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 13.76 ± 0.15
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 59.08 ± 1.13
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 12.86 ± 0.94

build: 1bb30bf2 (4149)

I understand, that the problem is due to the repack-on-load code (which is great because it avoids having to manually convert Q4_0 to Q4_0_4_4 or _8). It's a great idea, thanks, and it helps a lot also on Macs with containers/VMs. This is why I initially included the Q4_0_4_4 performance numbers (w/o repack).

But if there are now special build flags or using cmake instead of make needed (especially with very common hardware like the Apple silicon Macs), I think it needs to be documented thoroughly in the llama.cpp build instructions and also in the Mac performance comparison build instructions (discussion #4167),...

I also understand, that in the future build might be best standardized on cmake and make depreciated (see on-going discussion issue #10268 ). But until now and as documented in the build instructions,... "make" always worked nicely for common hardware+OS combinations. If it does not work anymore, it is a severe bug and needs to be documented thoroughly!

P.S.: please tell me, if I can support you or help. I'm currently on vacation and a bit slow in respponding, apologies.

chaxu01 commented 4 days ago

@AndreasKunar Thanks for the verification. I agree that the llama.cpp build instructions could be updated to better reflect the runtime repack case.

slaren commented 4 days ago

At least the ggml_cpu_has_xxx functions should also check if the build supports the feature and return false otherwise. If the ARM features are not automatically enabled with -march=native, it may also be good to enable them by default, since they are checked at runtime anyway.

slaren commented 2 days ago

I will reopen this since the issue with the ARM features not being automatically enabled with the default native builds is still not addressed.

chaxu01 commented 1 day ago

@slaren One option for fixing this could be to update the CMake script to conditionally add the appropriate compile options based on the detected CPU capabilities. This would ensure that the required ARM features, such as -march=armv8.2a and any specific flags like i8mm, are enabled when supported by the hardware.

slaren commented 1 day ago

Yes, that should work. Ideally it would work in the same way as x86, and automatically enable all the features of the local CPU when GGML_NATIVE is enabled, and otherwise individual features can be enabled manually.

chaxu01 commented 1 day ago

Thanks for confirming. I’ll work on updating the CMake script to ensure that, when GGML_NATIVE is enabled, all supported ARM features of the local CPU are automatically detected and enabled.

chaxu01 commented 19 hours ago

Created PR https://github.com/ggerganov/llama.cpp/pull/10487